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Abstract 

We  consider  the  problem  of  multi-task  reinforcement  learning  (MTRL)  in  multiple 
partially  observable  stochastic  environments.  We  introduce  the  regionalized  policy  repre¬ 
sentation  (RPR)  to  characterize  the  agent’s  behavior  in  each  environment.  The  RPR  is  a 
parametric  model  of  the  conditional  distribution  over  current  actions  given  the  history  of 
past  actions  and  observations;  the  agent’s  choice  of  actions  is  directly  based  on  this  con¬ 
ditional  distribution,  without  an  intervening  model  to  characterize  the  environment  itself. 

We  propose  off-policy  batch  algorithms  to  learn  the  parameters  of  the  RPRs,  using  episodic 
data  collected  when  following  a  behavior  policy,  and  show  their  linkage  to  policy  iteration. 

We  employ  the  Dirichlet  process  as  a  nonparametric  prior  over  the  RPRs  across  multiple 
environments.  The  intrinsic  clustering  property  of  the  Dirichlet  process  imposes  sharing 
of  episodes  among  similar  environments,  which  effectively  reduces  the  number  of  episodes 
required  for  learning  a  good  policy  in  each  environment,  when  data  sharing  is  appropriate. 

The  number  of  distinct  RPRs  and  the  associated  clusters  (the  sharing  patterns)  are  auto¬ 
matically  discovered  by  exploiting  the  episodic  data  as  well  as  the  nonparametric  nature  of 
the  Dirichlet  process.  We  demonstrate  the  effectiveness  of  the  proposed  RPR  as  well  as  the 
RPR-based  MTRL  framework  on  various  problems,  including  grid-world  navigation  and 
multi-aspect  target  classification.  The  experimental  results  show  that  the  RPR  is  a  com¬ 
petitive  reinforcement  learning  algorithm  in  partially  observable  domains,  and  the  MTRL 
consistently  achieves  better  performance  than  single  task  reinforcement  learning. 

1.  Introduction 

Planning  in  a  partially  observable  stochastic  environment  has  been  studied  extensively  in 
the  fields  of  operations  research  and  artificial  intelligence.  Traditional  methods  are  based  on 
partially  observable  Markov  decision  processes  (POMDPs)  and  assume  that  the  POMDP 
models  are  given  (iSondiki  II  i)Y  I 1:  Smallwood  and  Sondikl  Ilf) /Ml).  Many  POMDP  planning 
algorithms  (iSondikl  111)/  II II UYXI:  ICheng]  II DSSI:  ILoveioyl  lit) till;  IHansenl  II UDYI:  [Kaelbling  et  al.| 
Nt)t)ftl;  [Poupart  and  Boutifleij  IY1 )( Kil;  IPineau  et~aD  IY1HDSI;  |Spaan  and  Vlassisl  lYDOol:  hrnith  andl 
ISimmnmi  21)315;  ILLei-aJJ  ;20<)(i;il>)  have  been  proposed,  addressing  problems  of  increasing 
complexity  as  the  algorithms  become  progressively  more  efficient.  However,  the  assumption 
of  knowing  the  underlying  POMDP  model  is  often  difficult  to  meet  in  practice.  In  many 
cases  the  only  knowledge  available  to  the  agent  are  experiences,  i.e.,  the  observations  and 
rewards,  resulting  from  interactions  with  the  environment,  and  the  agent  must  learn  the 
behavior  policy  based  on  such  experience.  This  problem  is  known  as  reinforcement  learning 
(RL)  (Hutton  ajid,.ldar±5  1 1)1)8;.  Reinforcement  learning  methods  generally  fall  into  two 
broad  categories:  model-based  and  model-free.  In  model-based  methods,  one  first  builds 
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a  POMDP  model  based  on  experiences  and  then  exploits  the  existing  planning  algorithms 
to  find  the  POMDP  policy.  In  model-free  methods,  one  directly  infers  the  policy  based 
on  experiences.  The  focus  of  this  report  is  on  the  latter,  trying  to  find  the  policy  for  a 
partially  observable  stochastic  environment  without  the  intervening  stage  of  environment- 
model  learning. 

In  model-based  approaches,  when  the  model  is  updated  based  on  new  experiences  gath¬ 
ered  from  the  agent-environment  interaction,  one  has  to  solve  a  new  POMDP  planing  prob¬ 
lem.  Solving  a  POMDP  is  computationally  expensive,  which  is  particularly  true  when  one 
takes  into  account  the  model  uncertainty;  in  the  latter  case  the  POMDP  state  space  grows 
fast,  often  making  it  inefficient  to  find  even  an  approximate  solution  (jWang  et  al.|  EliUb). 
Recent  work  (boss  et  a,IJl2(H)<Sl)  gives  a  relatively  efficient  approximate  model-based  method, 
but  still  the  computation  time  grows  exponentially  with  the  planning  horizon.  By  contrast, 
model-free  methods  update  the  policy  directly,  without  the  need  to  update  an  intervening 
POMDP  model,  thus  saving  time  and  eliminating  the  errors  introduced  by  approximations 
that  may  be  made  when  solving  the  POMDP. 

Model-based  methods  suffer  particular  computational  inefficiency  in  multi-task  rein¬ 
forcement  learning  (MTRL),  the  problem  being  investigated  in  this  report,  because  one  has 
to  repeatedly  solve  multiple  POMDPs  due  to  frequent  experience-updating  arising  from  the 
communications  among  different  RL  tasks.  The  work  in  (Wilson  et  a,h  2B07)  assumes  the 
environment  states  are  perfectly  observable,  reducing  the  POMDP  in  each  task  to  a  Markov 
decision  process  (MDP);  since  a  MDP  is  relatively  efficient  to  solve,  the  computational  issue 
is  not  serious  there.  In  the  present  report,  we  assume  the  environment  states  are  partially 
observable,  thus  manifesting  a  POMDP  associated  with  each  environment.  If  model-based 
methods  are  pursued,  one  would  have  to  solve  multiple  POMDPs  for  each  update  of  the 
task  clusters,  which  entails  a  prohibitive  computational  burden. 

Model-free  methods  are  consequently  particularly  advantageous  for  MTRL  in  partially 
observable  domains.  The  regionalized  policy  representation  (RPR)  proposed  in  this  report, 
which  yields  an  efficient  parametrization  for  the  policy  governing  the  agent’s  behavior  in 
each  environment,  lends  itself  naturally  to  a  Bayesian  formulation  and  thus  furnishes  a 
posterior  distribution  of  the  policy.  The  policy  posterior  allows  the  agent  to  reason  and 
plan  under  uncertainty  about  the  policy  itself.  Since  the  ultimate  goal  of  reinforcement 
learning  is  the  policy,  the  policy’s  uncertainty  is  more  direct  and  relevant  to  the  learning 
goal  than  the  POMDP  model’s  uncertainty  as  considered  in  (Ross  et  a.  1,1  21108). 

The  MTRL  problem  considered  in  this  report  shares  similar  motivations  as  the  work 
in  (Wilson  et  al.  120071)  -  that  is,  in  many  real-world  settings  there  may  be  multiple  envi¬ 
ronments  for  which  policies  are  desired.  For  example,  a  single  agent  may  have  collected 
experiences  from  previous  environments  and  wishes  to  borrow  from  previous  experience 
when  learning  the  policy  for  a  new  environment.  In  another  case,  multiple  agents  are 
distributed  in  multiple  environments,  and  they  wish  to  communicate  with  each  other  and 
share  experiences  such  that  their  respective  performances  are  enhanced.  In  either  case  the 
experiences  in  one  environment  should  be  properly  exploited  to  benefit  the  learning  in  an¬ 
other  ((hirst riii  el  nl.  121 H l.-il) .  Appropriate  experience  sharing  among  multiple  environments 
and  joint  learning  of  multiple  policies  save  resources,  improve  policy  quality,  and  enhance 
generalization  to  new  environments,  especially  when  the  experiences  from  each  individual 
environment  are  scarce  iTIinnii  Q3H3).  Many  problems  in  practice  can  be  formulated  as 
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an  MTRL  problem,  with  one  example  given  in  ([Wilson  et  al.  2IHI7).  The  application  we 
consider  in  the  experiments  (see  Section  E2I3)  is  another  example,  in  which  we  make  the 
more  realistic  assumption  that  the  states  of  the  environments  are  partially  observable. 

To  date  there  has  been  much  work  addressing  the  problem  of  inferring  the  sharing 
structure  between  general  learning  tasks.  Most  of  the  work  follows  a  hierarchical  Bayesian 
approach,  which  assumes  that  the  parameters  (models)  for  each  task  are  sampled  from  a 
common  prior  distribution,  such  as  a  Gaussian  distribution  specified  by  unknown  hyper¬ 
parameters  (Ibawrence  and  Platt]  121 1041:  lYn  et,  al.lhU(IMl).  The  parameters  as  well  as  the  hyper¬ 
parameters  are  estimated  simultaneously  in  the  learning  phase.  In  (IBakker  and  I  leskTsl  •.■MIU.Ti 
a  single  Gaussian  prior  is  extended  to  a  Gaussian  mixture;  each  task  is  given  a  corresponding 
Gaussian  prior  and  related  tasks  are  allowed  to  share  a  common  Gaussian  prior.  Such  a 
formulation  for  information  sharing  is  more  flexible  than  a  single  common  prior,  but  still  has 
limitations:  the  form  of  the  prior  distribution  must  be  specified  a  priori ,  and  the  number 
of  mixture  components  must  also  be  pre-specified. 

In  the  MTRL  framework  developed  in  this  report,  we  adopt  a  nonparametric  approach 
by  employing  the  Dirichlet  process  (DP)  (|Perguson|  I !)/:’.)  as  our  prior,  extending  the  work 
in  (Yu  et  aT  Li()()4i;  IXue  et  aD  20(171)  to  model- free  policy  learning.  The  nonparametric 
DP  prior  does  not  assume  a  specific  form,  therefore  it  offers  a  rich  representation  that 
captures  complicated  sharing  patterns  among  various  tasks.  A  nonparametric  prior  drawn 
from  the  DP  is  almost  surely  discrete,  and  therefore  a  prior  distribution  that  is  drawn 
from  a  DP  encourages  task-dependent  parameter  clustering.  The  tasks  in  the  same  cluster 
share  information  and  are  learned  collectively  as  a  group.  The  resulting  MTRL  framework 
automatically  learns  the  number  of  clusters,  the  members  in  each  cluster  as  well  as  the 
associated  common  policy. 

The  nonparametric  DP  prior  has  been  used  previously  in  MTRL  ([Wilson  et.aL  20(1 1 ) , 
where  each  task  is  a  Markov  decision  process  (MDP)  assuming  perfect  state  observability. 
To  the  authors’  knowledge,  this  report  represents  the  first  attempt  to  apply  the  DP  prior 
to  reinforcement  learning  in  multiple  partially  observable  stochastic  environments.  Another 
distinction  is  that  the  method  here  is  model-free,  with  information  sharing  performed  di¬ 
rectly  at  the  policy  level,  without  having  to  learn  a  POMDP  model  first;  the  method  in 
(IWilson  et  al .120(17)  is  based  on  using  MDP  models. 

This  report  contains  several  technical  contributions.  We  propose  the  regionalized  policy 
representation  (RPR)  as  an  efficient  parametrization  of  stochastic  policies  in  the  absence 
of  a  POMDP  model,  and  develop  techniques  of  learning  the  RPR  parameters  based  on 
maximizing  the  sum  of  discounted  rewards  accrued  during  episodic  interactions  with  the 
environment.  An  analysis  of  the  techniques  is  provided,  and  relations  are  established  to  the 
expectation-maximization  algorithm  and  the  POMDP  policy  improvement  theorem.  We  for¬ 
mulate  the  MTRL  framework  by  placing  multiple  RPRs  in  a  Bayesian  setting  and  employ 
a  draw  from  the  Dirichlet  process  as  their  common  nonparametric  prior.  The  Dirichlet  pro¬ 
cess  posterior  is  derived,  based  on  a  nonconventional  application  of  Bayes  law.  Because  the 
DP  posterior  involves  large  mixtures,  Gibbs  sampling  analysis  is  inefficient.  This  motivates 
a  hybrid  Gibbs-variational  algorithm  to  learn  the  DP  posterior.  The  proposed  techniques 
are  evaluated  on  four  problem  domains,  including  the  benchmark  Hallway2  (Liftman  et  a.I. 
CM3) ,  its  multi-task  variants,  and  a  remote  sensing  application.  The  main  theoretical  re- 
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suits  in  the  report  are  summarized  in  the  form  of  theorems  and  lemmas,  the  proofs  of  which 
are  all  given  in  the  Appendix. 

The  RPR  formulation  in  this  report  is  an  extension  of  the  work  in  (O  EEDH;  II  Aa.n  et  af] 
Stiff?) .  All  other  content  in  the  report  is  extended  from  the  work  in  (O  2111!<>) . 

2.  Partially  Observable  Markov  Decision  Processes 

The  partially  observable  Markov  decision  process  (POMDP)  (Sondik  II UY  II:  Loveioyl  1991; 
[Kaelbling  et  al.|  X93B)  is  a  mathematical  model  for  the  optimal  control  of  an  agent  situated 
in  a  partially  observable  stochastic  environment.  In  a  POMDP  the  state  dynamics  of  the 
agent  are  governed  by  a  Markov  process,  and  the  state  of  the  process  is  not  completely 
observable  but  is  inferred  from  observations;  the  observations  are  probabilistically  related 
to  the  state.  Formally,  the  POMDP  can  be  described  as  a  tuple  ( S,A,T,0,Q,R ),  where 
S ,  A,  O  respectively  denote  a  finite  set  of  states,  actions,  and  observations;  T  are  state- 
transition  matrices  with  Tss/(a )  the  probability  of  transiting  to  state  s'  by  taking  action 
a  in  state  s;  P  are  observation  functions  with  Ps/D(a)  the  probability  of  observing  o  after 
performing  action  a  and  transiting  to  state  s'',  and  R  is  a  reward  function  with  R(s,a)  the 
expected  immediate  reward  received  by  taking  action  a  in  state  s. 

The  optimal  control  of  a  POMDP  is  represented  by  a  policy  for  choosing  the  best 
action  at  any  time  such  that  the  future  expected  reward  is  maximized.  Since  the  state 
in  a  POMDP  is  only  partially  observable,  the  action  choice  is  based  on  the  belief  state,  a 
sufficient  statistic  defined  as  the  probability  distribution  of  the  state  s  given  the  history  of 
actions  and  observations  (Sondik  1 971 ).  It  is  important  to  note  that  computation  of  the 
belief  state  requires  knowing  the  underlying  POMDP  model. 

The  belief  state  constitutes  a  continuous-state  Markov  process  (Smallwood  and  Sondik' 
II  MY. 'll) .  Given  that  at  time  t  the  belief  state  is  b  and  the  action  a  is  taken,  and  the  observation 
received  at  time  t  +  1  is  o,  then  the  belief  state  at  time  t  +  1  is  computed  by  Bayes  rule 

KW)  =  (1) 

p(o\b,  a) 

where  the  superscript  a  and  the  subscript  o  are  used  to  indicate  the  dependence  of  the  new 
belief  state  on  a  and  o,  and 


p(o\b,a)= 

s'GS  sGS 


(2) 


is  the  probability  of  transiting  from  b  to  b'  when  taking  action  a. 

Equations  (□)  and  (0)  imply  that,  for  any  POMDP,  there  exists  a  corresponding  Markov 
decision  process  (MDP),  the  state  of  which  coincides  with  the  belief  state  of  the  POMDP 
(hence  the  term  “belief-state  MDP”).  Although  the  belief  state  is  continuous,  their  transi¬ 
tion  probabilities  are  discrete  :  from  any  given  b,  one  can  only  make  a  transition  to  a  finite 
number  of  new  belief  states  {6“  :  a  €  A,  o  £  O},  assuming  A  and  O  are  discrete  sets  with 
finite  alphabets.  For  any  action  a  G  A,  the  belief  state  transition  probabilities  are  given  by 


p{b'\b,  a) 


p(o\b,  a),  if  b'  =  baQ 
0,  otherwise 


(3) 
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The  expected  reward  of  the  belief-state  MDP  is  given  by 

R(b, a)  =  X/  b(s)R(s’ a )  (1 * * 4) 

seS 

In  summary,  the  belief-state  MDP  is  completely  defined  by  the  action  set  A,  the  space  of 
belief  state 

B  =  \  b  €  R|s|  :  b(s)  >  0,  ^  b(s)  =  1 
l  ses 

along  with  the  belief  state  transition  probabilities  in  (0)  and  the  reward  function  in  (0). 

The  optimal  control  of  the  POMDP  can  be  found  by  solving  the  corresponding  belief- 
state  MDP.  Assume  that  at  any  time  there  are  infinite  steps  remaining  for  the  POMDP 
(infinite  horizon),  the  future  rewards  are  discounted  exponentially  with  a  factor  0  <  7  <  1, 
and  the  action  is  drawn  from  pu(a\b),  then  the  expected  reward  accumulated  over  the 
infinite  horizon  satisfies  the  Bellman  equation  (iBel  lma.nl  II  HU VI;  ISmallwood  and  Sondikl  11213) 


Pn(6)  =  J>n(a  \b) 

a^A 


R(b,  a)  +  7  a)yll(6o) 

oeo 


(5) 


where  Pn(6)  is  called  the  value  function.  Bondlkl  (E273)  showed  that,  for  a  finite-transient 
deterministic  policy  there  exists  a  Markov  partition  B  =  B\  U  £>2  U  •  •  •  satisfying  the 
following  two  properties  : 

(a)  There  is  a  unique  optimal  action  a*  associated  with  subset  Bi,  i  =  1,2,---.  This 
implies  that  the  optimal  control  is  represented  by  a  deterministic  mapping  from  the 
Markov  partition  to  the  set  of  actions. 

(b)  Each  subset  maps  completely  into  another  (or  itself),  i.e. ,  {&“  :  b  E  Bi,a  =  11(6),  o  € 
O}  C  Bj  (i  may  equal  j). 

The  Markov  partition  yields  an  equivalent  representation  of  the  finite-transient  deterministic 
policy.  Sondik  noted  that  an  arbitrary  policy  II  is  not  likely  to  be  finite-transient,  and  for  it 
one  can  only  construct  a  partition  where  one  subset  maps  partially  into  another  (or  itself), 
i.e.,  there  exists  b  E  Bi  and  o  E  O  such  that  60  ^  ^  Bj.  Nevertheless,  the  Markov  partition 
provides  an  approximate  representation  for  non-finite-transient  policies  and  Sondik  gave 
an  error  bound  of  the  difference  between  the  true  value  function  and  approximate  value 
function  obtained  by  the  Markov  partition.  Based  on  the  Markov  partition,  Sondik  also 
proposed  a  policy  iteration  algorithm  for  POMDPs,  which  was  later  improved  by  II  lansciii 
(1223)  and  the  improved  algorithm  is  referred  to  as  finite  state  controller  (the  partition  is 
finite) . 

1.  Let  II  be  a  deterministic  policy,  i.e.,  pn(a|&)  =  |  g’  otherwise^  '  ‘-’n  he  the  set  of  all  possi¬ 

ble  belief-states  when  II  has  been  followed  for  n  consecutive  steps  by  starting  from  any  initial  belief- 

state.  The  II  is  finite  transient  if  and  only  if  there  exists  n  <  00  such  that  ss  is  disjoint  with 

{ b  :  11(6)  is  discontinuous  at  6}  (iSonrlikl II HViSl). 
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3.  Regionalized  Policy  Representation 

We  are  interested  in  model- free  policy  learning,  i.e. ,  we  assume  the  model  of  the  POMDP 
is  unknown  and  aim  to  learn  the  policy  directly  from  the  experiences  (data)  collected  from 
agent-environment  interactions.  One  may  argue  that  we  do  in  fact  learn  a  model,  but  our 
model  is  directly  at  the  policy  level,  constituting  a  probabilistic  mapping  from  the  space  of 
action-observation  histories  to  the  action  space. 

Although  the  optimal  control  of  a  POMDP  can  be  obtained  via  solving  the  corresponding 
belief-state  MDP,  this  is  not  true  when  we  lack  an  underlying  POMDP  model.  This  is 
because,  as  indicated  above,  the  observability  of  the  belief-state  depends  on  the  availability 
of  the  POMDP  model.  When  the  model  is  unknown,  one  does  not  have  access  to  the 
information  required  to  compute  the  belief  state,  making  the  belief  state  unobservable. 

In  this  report,  we  treat  the  belief-state  as  a  hidden  (latent)  variable  and  marginalize  it 
out  to  yield  a  stochastic  POMDP  policy  that  is  purely  dependent  on  the  observable  history, 
i.e.,  the  sequence  of  previous  actions  and  observations.  The  belief-state  dynamics,  as  well  as 
the  optimal  control  in  each  state,  is  learned  empirically  from  experiences,  instead  of  being 
computed  from  an  underlying  POMDP  model.  Although  it  may  be  possible  to  learn  the 
dynamics  and  control  in  the  continuous  space  of  belief  state,  the  exposition  in  this  report 
is  restricted  to  the  discrete  case,  i.e.,  the  case  for  which  the  continuous  belief-state  space  is 
quantized  into  a  finite  set  of  disjoint  regions.  The  quantization  can  be  viewed  as  a  stochastic 
counterpart  of  the  Markov  partition  (Sondik  TTT7H),  discussed  at  the  end  of  Section  □.  With 
the  quantization,  we  learn  the  dynamics  of  belief  regions  and  the  local  optimal  control  in 
each  region,  both  represented  stochastically.  The  stochasticity  manifests  the  uncertainty 
arising  from  the  belief  quantization  (the  policy  is  parameterized  in  terms  of  latent  belief 
regions ,  not  the  precise  belief  state).  The  stochastic  policy  reduces  to  a  deterministic  one 
when  the  policy  is  finitely  transient,  in  which  case  the  quantization  becomes  a  Markov 
partition.  The  resulting  framework  is  termed  regionalized  policy  representation  to  reflect 
the  fact  that  the  policy  of  action  selection  is  expressed  through  the  dynamics  of  belief  regions 
as  well  as  the  local  controls  in  each  region.  We  also  use  decision  state  as  a  synonym  of  belief 
region,  in  recognition  of  the  fact  that  each  belief  region  is  an  elementary  unit  to  encode  the 
decisions  of  action  selection. 

3.1  Formal  Framework 

Definition  1.  A  regionalized  policy  representation  (RPR)  is  a  tuple  (A,  O,  Z,  W,  /r,  7r)  spec¬ 
ified  as  follows.  The  A  and  O  are  respectively  a  finite  set  of  actions  and  observations.  The  Z 
is  a  finite  set  of  decision  states  (belief  regions).  The  W  are  decision-state  transition  matrices 
with  W(z,  a,  o',  z')  denoting  the  probability  of  transiting  from  z  to  z'  when  taking  action 
a  in  decision  state  z  results  in  observing  o' .  The  p,  is  the  initial  distribution  of  decision 
states  with  /r(z)  denoting  the  probability  of  initially  being  in  decision  state  z.  The  n  are 
state-dependent  stochastic  policies  with  ir(z,  a)  denoting  the  probability  of  taking  action  a 
in  decision  state  z. 


The  stochastic  formulation  of  W  and  n  in  Definition  □  is  fairly  general  and  subsumes 
two  special  cases. 
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1.  If  z  shrinks  down  to  a  single  belief-state  b,  z  =  b  becomes  a  sufficient  statistic  of  the 
POMDP  femallwoiicLa,n.cLb.QrLdikl  Q3Z.3)  and  there  is  a  unique  action  associated  with 
it,  thus  7 t(z,  a)  is  deterministic  and  the  local  policy  can  be  simplified  as  a  =  n(b). 

2.  If  the  belief  regions  form  a  Markov  partition  of  the  belief-state  space  (Sondik  I D7H), 
i.e.,  B  =  UZ£zBz,  then  the  action  choice  in  each  region  is  constant  and  one  region 
transits  completely  to  another  (or  itself).  In  this  case,  both  W  and  ir  are  deterministic 
and,  moreover,  the  policy  yielded  by  the  RPR  (see  (0))  is  finite  transient  deterministic. 
In  fact  this  is  the  same  case  as  considered  in  (Hausenl  I1IIH7). 

In  both  of  the  two  special  cases,  each  z  has  one  action  choice  a  =  tt(z)  associated  with 
it,  and  one  can  write  W(z,  a,  o',  z')  =  W(z,  tt(z),  o',  z1),  thus  the  transition  of  z  is  driven 
solely  by  o.  In  general,  each  z  represents  multiple  individual  belief-states,  and  the  belief 
region  transition  is  driven  jointly  by  a  and  o.  The  action-dependency  captures  the  state 
dynamics  of  the  POMDP,  and  the  observation-dependency  reflects  the  partial  observability 
of  the  state  (perception  aliasing). 

To  make  notation  simple,  the  following  conventions  are  observed  throughout  the  report: 

•  The  elements  of  A  are  enumerated  as  A  =  {1,2,  •••  where  |M|  denotes  the 

cardinality  of  A.  Similarly,  O  =  {1,  2,  •  •  •  ,  \0\}  and  Z  =  {1, 2,  •  ■  ■  ,  \Z\}. 

•  A  sequence  of  actions  (ao,oi,--  -  ,  ot)  is  abbreviated  as  oo:t,  where  the  subscripts 
index  discrete  time  steps.  Similarly  a  sequence  of  observations  (oi,02,---  ,ot )  is 
abbreviated  as  o\:t,  and  a  sequence  of  decision  states  (zq,Zi,  ■  ■  ■  ,zt)  is  abbreviated 
as  zo:T,  etc. 

•  A  history  ht  is  the  set  of  actions  executed  and  observation  received  up  to  time  step  t, 
i.e.,  ht  =  {ao:t-ij  oi;t}. 

Let  0  =  { 7r ,  (i,  W}  denote  the  parameters  of  the  RPR.  Given  a  history  of  actions  and  ob¬ 
servations,  ht  =  (ao:t— i,  out),  collected  up  to  time  step  t,  the  RPR  yields  a  joint  probability 
distribution  of  ZQ:t  and  a^:t 


t 

p(ao-.t,  Zo-.t\oi:t,  @)  =  p{z0)tt(z0,  a0)  IT (zr_i,  ar_i,  oT,  zt)tt(zt,  aT)  (6) 

T=  1 

where  application  of  local  controls  n(zt,  at)  at  every  time  step  implies  that  ao:t  are  all  drawn 
according  to  the  RPR.  The  decision  states  zo:t  in  (0)  are  hidden  variables  and  we  marginalize 
them  to  get 


\z\ 

P(a0:t\oi:t,@)  = 

^0,-"  ,Zt  =  1 


n(z0)Tr(z0,a0 )  J^W(zT-1,aT-i,oT,zT)n(zT 


T=  1 


(7) 


It  follows  from  (□)  that 


Ml 


p(aO:t_l|oi:t,0)  =  ^  p{aQ:t\0l:t,  Q) 

at= 1 
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\Z\ 

-  £ 

ZO,'"  ,Zt_  1=1 


t-1 

/u(2r0)7r(2:o,ao)  VF(zr_i,  aT-i,  oT,  ^r)-7r(zr, 

T— 1 


Ml  Ml 

xEE  VF (zt_i,  at_i,  ot,  zt)n(zt,  at) 
at= 1  zt=l 

V - V - ' 

=  1 


&T  ) 


(8) 


which  implies  that  observation  ot  does  not  influence  the  actions  before  t,  in  agreement  with 
expectations.  From  (□)  and  (0),  we  can  write  the  history-dependent  distribution  of  action 
choices 


p{aT\hT,  0) 


p(flT | O0:r—1 !  Ol:ri  ®) 


P(QQ:r|oi:r,Q) 
p(aO:r-l|oi:T,  0) 


p(QQ:r|oi:r,  Q) 
p(ttO:r— 1 1  Ol:r—  1 5  0) 


(9) 


which  gives  a  stochastic  RPR  policy  for  choosing  the  action  at,  given  the  historical  actions 
and  observations.  The  policy  is  purely  history-dependent,  with  the  unobservable  belief 
regions  z  integrated  out. 

The  history  ht  forms  a  Markov  process  with  transitions  driven  by  actions  and  observa¬ 
tions:  ht  =  ht~  1  U  {at-i,ot}.  Applying  this  recursively,  we  get  ht  =  U*=1{aT_i,  or},  and 
therefore 


t 

Y[p(aT\hT,Q) 

T= 0 


t- 2 


Y[p(ar\hT,Q) 

T= 0 

t- 2 

|""J  P{Pr\hr,  0) 

T= 0 
t- 3 

P{Pr\hT,  0) 


_  r=0 
t— 3 


p(flT|/lT,  0) 

_  T=0 


p(at-i\ht-i,Q)p{at\ht-i,at-i,ot,  0) 


p(ai_m|ht_i,ot,0) 


p(oi_2|ht_2,  0)p(at_i:t|ht_2,  at-2,  ot-i,ot,  0) 


p(at_2:t|ht_2,Ot_i:t,  0) 


=  p{aO:t\ho,Ol:t,@) 

=  p(ao:t|oi:t,0)  (10) 

where  we  have  used  p(aT\hT,  or+i:t)  =  p(arl^-r)  and  ho  =  null.  The  rightmost  side  of  (DU) 
is  the  observation-conditional  probability  of  joint  action-selection  at  multiple  time  steps 
r  =  0, 1,  •  •  •  ,  t.  Equation  (HO)  can  be  verified  directly  by  multiplying  (0)  over  r  =  0, 1,  •  •  •  ,t 

t 

Y[p(aT\hT,@) 

T= 0 

-  p(a0|e)p(QO:l101’  ^  p(a0:2lOl:2’  p{ao:t-i\oi-.t-i,  0)  p(ao:t\oi:t,  0) 
p(aO|0)  p(aO:l|oi,0)  p(a0:t-2\oi-.t-2,Q)  p(a0:t-l\oi:t-l,&) 

=  p(a0:t\oi:t,@) 
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(11) 


It  is  of  interest  to  point  out  the  difference  between  the  RPR  and  previous  reinforcement 
learning  algorithms  for  POMDPs.  The  reactive  policy  and  history  truncation  (dank kola! 
let  al.  I  !)!)•>:  IBaxter . and  Part  lei  II 1201)  1 1)  condition  the  action  only  upon  the  immediate  obser¬ 

vation  or  a  truncated  sequence  of  observations,  without  using  the  full  history,  and  therefore 
these  are  clearly  different  from  the  RPR.  The  U-tree  (Md  JalhTml  li  1)05)  stores  historical 
information  along  the  branches  of  decision  trees,  with  the  branches  split  to  improve  the 
prediction  of  future  return  or  utility.  The  drawback  is  that  the  tree  may  grow  intolerably 
fast  with  the  episode  length.  The  finite  policy  graphs  ( Monloan  el  al J!  I 999).  finite  state 
controllers  (lAherdeen  and  Baxter!  129! H) ,  and  utile  distinction  HMMs  (iWierstra  and  Wield 
jihgi  gfTO)  use  internal  states  to  memorize  the  full  history,  however,  their  state  transitions 
are  driven  by  observations  only.  In  contrast,  the  dynamics  of  decision  states  in  the  RPR 
are  driven  jointly  by  actions  and  observations,  the  former  capturing  the  dynamics  of  world- 
states  and  the  latter  reflecting  the  perceptual  aliasing.  Moreover,  none  of  the  previous 
algorithms  is  based  on  Bayesian  learning,  and  therefore  they  are  intrinsically  not  amenable 
to  the  Dirichlet  process  framework  that  is  used  in  the  RPR  for  multi-task  examples. 


3.2  The  Learning  Objective 

We  are  interested  in  empirical  learning  of  the  RPR,  based  on  a  set  of  episodes  defined  as 
follows. 

Definition  2.  (Episode)  An  episode  is  a  sequence  of  agent-environment  interactions  ter¬ 
minated  in  an  absorbing  state  that  transits  to  itself  with  zero  rewards  (Sutton  and  Barfd 
19231).  An  episode  is  denoted  by  (a^r^o^a^r^  •  •  •  ),  where  the  subscripts  are  dis¬ 

crete  times,  k  indexes  the  episodes,  and  o,  a,  and  r  are  respectively  observations,  actions, 
and  immediate  rewards. 


Definition  3.  (Sub-episode)  A  sub-episode  is  an  episode  truncated  at  a  particular  time 
step  and  retaining  the  immediate  reward  only  at  the  time  step  where  truncation  occurs. 
The  f-th  sub-episode  of  episode  (aQrQo\a\r\  ■  ■  ■  o^a^r^)  is  defined  as  ( aQo\a\  ■  ■  ■  o^a^r1^), 
which  yields  a  total  of  T*,  +  1  sub-episodes  for  this  episode. 

The  learning  objective  is  to  maximize  the  optimality  criterion  given  in  Definition  EJ. 
Theorem  0  introduced  below  establishes  the  limit  of  the  criterion  when  the  number  of 
episodes  approaches  infinity. 

Definition  4.  (The  RPR  Optimality  Criterion)  Let  V^h')  =  {{aQrQo\a\r\  ■  ■  ■ 
be  a  set  of  episodes  obtained  by  an  agent  interacting  with  the  environment  by  following 
policy  II  to  select  actions,  where  II  is  an  arbitrary  stochastic  policy  with  action-selecting 
distributions  pn(at\ht)  >  0,  V  action  at,  V  history  ht .  The  RPR  optimality  criterion  is 
defined  as 


K  Tk 


VC 


t  k 

y'r'l 


nr=0pn(ari^r) 


(12) 


where  h ^  =  aQo\a\  ■  ■  ■  o\  is  the  history  of  actions  and  observations  up  to  time  t  in  the  1-th 
episode,  0  <  7  <  1  is  the  discount,  and  0  denotes  the  parameters  of  the  RPR. 
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Theorem  5.  Let  V(T>^K^-,  0)  be  as  defined  in  Definition^.  then  limx->oo  0)  is  the 

expected  sum  of  discounted  rewards  within  the  environment  under  test  by  following  the  RPR 
policy  parameterized  by  0,  over  an  infinite  horizon. 

Theorem  0  shows  that  the  optimality  criterion  given  in  Definition  0  is  the  expected 
sum  of  discounted  rewards  in  the  limit,  when  the  number  of  episodes  approaches  infinity. 
Throughout  the  report,  we  call  lirri/^oo  0)  the  value  function  and  V(fD^K>\  0)  the 

empirical  value  function.  The  0  maximizing  the  (empirical)  value  function  is  the  best  RPR 
policy  (given  the  episodes). 

It  is  assumed  in  Theorem  0  that  the  behavior  policy  II  used  to  collect  the  episodic  data 
is  an  arbitrary  policy  that  assigns  nonzero  probability  to  any  action  given  any  history,  i.e. , 
II  is  required  to  be  a  soft  policy  (Sutton  and  Bartini  DPXT).  This  premise  assures  a  complete 
exploration  of  the  actions  that  might  lead  to  large  immediate  rewards  given  any  history, 
i.e.,  the  actions  that  might  be  selected  by  the  optimal  policy. 

4.  Single- Task  Reinforcement  Learning  (STRL) 

We  develop  techniques  to  maximize  the  empirical  value  function  in  (E2I)  and  the  0  resulting 
from  value  maximization  is  called  a  Maximum- Value  (MV)  estimate  (related  to  maximum 
likelihood).  An  MV  estimate  of  the  RPR  is  preferred  when  the  number  of  episodes  is 
large,  in  which  case  the  empirical  value  function  approaches  the  true  value  function  and  the 
estimate  is  expected  to  approach  the  optimal  (assuming  the  algorithm  is  not  trapped  in  a 
local  minima).  The  episodes  are  assumed  to  have  been  collected  in  a  single  partially 
observable  stochastic  environment,  which  may  corresponds  to  a  single  physical  environment 
or  a  pool  of  multiple  identical/similar  physical  environments.  As  a  result,  the  techniques 
developed  in  this  section  are  for  single-task  reinforcement  learning  (STRL). 

By  substituting  (□)  and  (HO)  into  (H3),  we  rewrite  the  empirical  value  function, 

K  Tfc  |  Z  | 

V{V{k)-Q)  =  fEV?  £  lA.e)  as; 

k=  1  t= 0  z}f  ...  zk  =  1 

^0’  1 

where 

f*  =  t  7V" _  (14) 

Ur=0PU(aT\hr) 

is  the  discounted  immediate  reward  weighted  by  the  inverse  probability  that  the  be¬ 
havior  policy  II  has  generated  r^.  The  weighting  is  a  result  from  importance  sampling 
(Robert  and  Casellal  199.91),  and  reflects  the  fact  that  rf  is  obtained  by  following  II  but  the 
Monte  Carlo  integral  (i.e.,  the  empirical  value  function)  is  with  respect  to  the  RPR  policy 
0.  For  simplicity,  ftf  is  also  referred  to  as  discounted  immediate  reward  or  simply  reward 
throughout  the  report. 

We  assume  rt  >  0  (and  hence  >  0),  which  can  always  be  achieved  by  adding  a 
constant  to  ry;  this  results  in  a  constant  added  to  the  value  function  (the  value  function  of 
a  POMDP  is  linear  in  immediate  reward)  and  does  not  influence  the  policy. 
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Theorem  6.  (Maximum  Value  Estimation)  Let 


4(4,  l@(n))  = 


rt 


:p(4„4,l4„e(”)) 


7(DW;0W) 

for  zjf  =  1,  2,  •  •  •  ,  |i?|,  t  =  1,2,---  ,  Xfc,  and  k  =  1,  2,  •  •  •  ,  K .  Let 

K  Tk  \Z\  ~k  /  k  |  k 


©(n+1)  =  argmax  ^  ^  E  E  4(4t|0(n))  In 


©G-F 


A' 


fc=l  t=0  a*,...  )Zfc=i 


^P(a0:t.4tl°l;t»e) 

4(4tl0(n)) 


where 


f  l-ZI  W  l-HI  _ 

jr=  i  0  =  (/r,7r,IP)  :  =  l,^vf(i,a)  =  1,  ^  TT(i,  a,  o,  j)  =  1, 

l  7=1  a=l  7=1 


j=l  a=i  J=1 

*  =  1,2,-- -  ,121,0  =  1, 2, ,|-4|,  0  =  1,2,  •••  ,p| 


(15) 


(16) 


(17) 


is  t/ie  set  of  feasible  parameters  for  the  RPR  in  question.  Let  {0(°)@W  •  •  •  0©)  •  •  •  }  be  a 
sequence  yielded  by  iteratively  applying  m  and  m),  starting  from  .  Then 

lim  V(V{K)]®{n)) 

n— >-oo 


exists  and  the  limit  is  a  maxima  of  V (T>^h^ ;  . 

To  gain  a  better  understanding  of  Theorem  D,  we  rewrite  (113)  to  get 

4(4,10)  =  °l:t.  9)  (18) 

where  p(4:tla0:t)  °i:t>  ©)  is  an  standard  posterior  distribution  of  the  latent  decision  states 
given  the  0  updated  in  the  most  recent  iteration  (the  superscript  ( n)  indicating  the  iteration 
number  has  been  dropped  for  simplicity),  and 

4(®)  D= '  4j>(4,  l4„®)  (if) 

is  called  the  re-computed  reward  at  time  step  t  in  the  k-th  episode.  The  re-computed 
reward  represents  the  discounted  immediate  reward  weighted  by  the  probability  that  the 
action  sequence  yielding  this  reward  is  generated  by  the  RPR  policy  parameterized  by  0, 
therefore  4(0)  is  a  function  of  0.  The  re-computed  reward  reflects  the  update  of  the  RPR 
policy  which,  if  allowed  to  re-interact  with  the  environment,  is  expected  to  accrue  larger 
rewards  than  in  the  previous  iteration.  Recall  that  the  algorithm  does  not  assume  real 
re-interactions  with  the  environment  so  the  episodes  themselves  cannot  update.  However, 
by  recomputing  the  rewards  as  in  (ED),  the  agent  is  allowed  to  generate  an  internal  set 
of  episodes  in  which  the  immediate  rewards  are  modified.  The  internal  episodes  represent 
the  new  episodes  that  would  be  collected  if  the  agent  followed  the  updated  RPR  to  really 
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re-interact  with  the  environment.  In  this  sense,  the  reward  re-computation  can  be  thought 
of  as  virtual  re- interactions  with  the  environment. 

By  (HE),  <jf(4:t)  is  a  weighted  version  of  the  standard  posterior  of  ZQ.t,  with  the  weight 
given  by  the  reward  recomputed  by  the  RPR  in  the  previous  iteration.  The  normalization 
constant  R(pW;@),  which  is  also  the  empirical  value  function  in  (112),  can  be  expressed 
as  the  recomputed  rewards  averaged  over  all  episodes  at  all  time  steps, 

F(PW;0)  =  1^J> tk(Q)  (20) 

k= 1  t= 0 

which  ensures 

I<  Tk  \Z\ 

^EE  E  «f(4.|e)  =  i  (2i) 

k=  1  t= 0  ...  zt=l 

zo  ’  ->zt  ~ 1 

The  maximum  value  (MV)  algorithm  based  on  alternately  applying  (HE)  and  (HE)  in 
Theorem  0  bears  strong  resemblance  to  the  expectation-maximization  (EM)  algorithms 
(IDempster  et  al.i  10V VI)  widely  used  in  statistics,  with  (HE)  and  (HE)  respectively  correspond¬ 
ing  to  the  E-step  and  M-step  in  EM.  However,  the  goal  in  standard  EM  algorithms  is  to 
maximize  a  likelihood  function,  while  the  goal  of  the  MV  algorithm  is  to  maximize  an 
empirical  value  function.  This  causes  significant  differences  between  the  MV  and  the  EM. 
It  is  helpful  to  compare  the  MV  algorithm  in  Theorem  0  to  the  EM  algorithm  for  maxi¬ 
mum  likelihood  (ML)  estimation  in  hidden  Markov  models  (Ra bluer  HEBE),  since  both  deal 
with  sequences  or  episodes.  The  sequences  in  an  HMM  are  treated  as  uniformly  impor¬ 
tant,  therefore  parameter  updating  is  based  solely  on  the  frequency  of  occurrences  of  latent 
states.  Here  the  episodes  are  not  equally  important  because  they  have  different  rewards 
associated  with  them,  which  determine  their  importance  relative  to  each  other.  As  seen  in 
(HE),  the  posterior  of  z^.t  is  weighted  by  the  recomputed  reward  of,  which  means  that  the 
contribution  of  episode  k  (at  time  t)  to  the  update  of  0  is  not  solely  based  on  the  frequency 
of  occurrences  of  ZQ.t  but  also  based  on  the  associated  of’.  Thus  the  new  parameters  0 
will  be  adjusted  in  such  a  way  that  the  episodes  earning  large  rewards  have  more  “credits” 
recorded  into  0  and,  as  a  result,  the  policy  parameterized  by  0  will  more  likely  generate 
actions  that  lead  to  high  rewards. 

The  objective  function  being  maximized  in  (HE)  enjoys  some  interesting  properties  due 
to  the  fact  that  4(4:t)  is  a  weighted  posterior  of  Zq.v  These  properties  not  only  establish  a 
more  formal  connection  between  the  MV  algorithm  here  and  the  traditional  ML  algorithm 
based  on  EM,  they  also  shed  light  on  the  close  relations  between  Theorem  □  and  the  policy 
improvement  theorem  of  POMDP  ( Blackwell.  1 1  VXi-ili .  To  show  these  properties,  we  rewrite 
the  objective  function  in  (HE)  (with  the  subscript  ^  dropped  for  simplicity)  as 


LB(0|0)  D=  fEE 


\z\ 

E 


(4t  I  ©)  ^ 


rtP(aO:t,4:t\°l:t’&) 


k=l  t=Ozkt...  >zfc=i 


K  Tk 

EE 


4(@) 


\Z\ 


Ktit^V(V(K)-Q) 

z0  »"■  y^t  ~L 


E  P(4tlaE°l:t>0)ln 


4(4tl0) 

4>(4t>4tK:t>0) 


_ at(®) _ n(zk  I  „k  k 

V{V(K).e)P^0:t\a0:ti  °1:V 


(22) 
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where  the  second  equation  is  obtained  by  substituting  (HE)  into  the  left  side  of  it.  Since 

j K  Efc= 1  ESo  v{vS).e)  =  1  and  E]|!...>zfc=1p(4tla0:t>0i:t>  0)  =  one  can  appiy  Jensen’s 
inequality  twice  to  the  rightmost  side  of  (E2)  to  obtain  two  inequalities 


LB(0|0) 


1  K  T fc 

< 

a^ih^(k);0) 


4'(0)  ln^V(aO:J°l:t>0)  ^£/' 


<’(©) 


=J'  T(0|0) 


<  In 


1 

K 


K  Tk 


v(p(K)-e) 


fc=l  i=0 


=  lnF(PW;0) 


(23) 


where  the  first  inequality  is  with  respect  to  p(zQ.t\aQ.t,  o\:t,  ©)  while  the  second  inequality 
is  with  respect  to  j  y^ugo)  ■  t  =  1,  •  •  •  ,  Tfc,  k  =  1,  •  •  •  .  K j .  Each  inequality  yields  a  lower 

bound  to  the  logarithmic  empirical  value  function  In  V{T>^K^\  0).  It  is  not  difficult  to  verify 
from  (E3)  and  (E3)  that  both  of  the  two  lower  bounds  are  tight  (the  respective  equality  can 
be  reached),  i.e. , 

LB(0|0)  =  lnV"(pW;0)  =  T(0|0)  (24) 

The  equations  in  (El)  along  with  the  inequalities  in  (E3)  show  that  any  0  satisfying 
LB(0|0)  <  LB(0|0)  or  T(0|0)  <  T(©|0)  also  satisfies  E(pW;0)  <  7(#);6).  Thus 
one  can  choose  to  maximize  either  of  the  two  lower  bounds,  LB(0|0)  or  Y(0|0),  when 
trying  to  improve  the  empirical  value  of  ©  over  that  of  0.  In  either  case,  the  maximization 
is  with  respect  to  0. 

The  two  alternatives,  though  both  yielding  an  improved  RPR,  are  quite  different  in  the 
manner  the  improvement  is  achieved.  Suppose  one  has  obtained  00)  by  applying  (H3)  and 
(CO)  for  n  iterations,  and  is  seeking  0(n+1)  satisfying  f(pW;0^)  <  V(T>^;  00+1)). 
Maximization  of  the  first  lower  bound  gives  0(n+1)  =  argmaxgg^.  LB(0|00)),  which  has 
an  analytic  solution  that  will  be  given  in  Section  E3.  Maximization  of  the  second  lower 
bound  yields 

00+1)  _  argmax'X(0 1 00))  (25) 

GeT 


The  definition  of  T  in  (E3)  is  substituted  into  (E5I)  to  yield 


@(™+i) 


1 

arg  max  — 
GeJ7  K 


K  Tk 

EE 


k=  1  t= 0 


CTffc(00)) 

b(DW;0W) 


In 


of  (0(n)) 
V(V(K~>  ;&(")) 


K  Tk 

arg  max  y  y  crtk(0(n})  In p(ag:t|oJ.t,  0) 


(26) 


which  shows  that  maximization  of  the  second  lower  bound  is  equivalent  to  maximizing  a 
weighted  sum  of  the  log- likelihoods  of  {a^},  with  the  weights  being  the  rewards  recomputed 
by  00).  Through  (EH),  the  connection  between  the  maximum  value  algorithm  in  Theorem 
D  and  the  traditional  ML  algorithm  is  made  more  formal  and  clearer:  with  the  recomputed 
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rewards  given  and  fixed,  the  MV  algorithm  is  a  weighted  version  of  the  ML  algorithm,  with 
T(0|0(n))  a  weighted  log-likelihood  function  of  0. 

The  above  analysis  also  sheds  light  on  the  relations  between  Theorem  D  and  the  policy 
improvement  theorem  in  POMDP  (iBIackweTH II  qtihl).  By  (E3),  (EH),  and  (ED),  we  have 

lnV(X>(A:);0(n))  =  T(0(n)|0(n))  <  T(0(n+1)|0(n)) 

<  lnV(pW;0(n+1))  (27) 

The  first  inequality,  achieved  by  the  weighted  likelihood  maximization  in  (ED),  represents 
the  policy  improvement  on  the  old  episodes  collected  by  following  the  previous  policy.  The 
second  inequality  ensures  that,  if  the  improved  policy  is  followed  to  collect  new  episodes  in 
the  environment,  the  expected  sum  of  newly  accrued  rewards  is  no  less  than  that  obtained 
by  following  the  previous  policy.  This  is  similar  to  policy  evaluation.  Note  that  the  update 
of  episodes  is  simulated  by  reward  computation.  The  actual  episodes  are  collected  by  a 
fixed  behavior  policy  II  and  do  not  change. 

The  maximization  in  (ED)  can  be  performed  using  any  optimization  techniques.  As  long 
as  the  maximization  is  achieved,  the  policy  is  improved  as  guaranteed  by  Theorem  0.  Since 
the  latent  z  variables  are  involved,  it  is  natural  to  employ  EM  to  solve  the  maximization. 
The  EM  solution  to  (ED)  is  obtained  by  solving  a  sequence  of  maximization  problems: 
starting  from  =  ©(n),  one  successively  solves 

0(n)(j)  =  argmaxLBt©!©^-1))  subject  to  ^(©W^1')  =  ^(©W)  Vf,  k  (28) 

3  =  1,2,  -  -  - 

where  in  each  problem  one  maximizes  the  first  lower  bound  with  an  updated  posterior  of 
{zf}  but  with  the  recomputed  rewards  fixed  at  {of(©(n))};  upon  convergence,  the  solution 
of  (ED)  is  the  solution  to  (ED).  The  EM  solution  here  is  almost  the  same  as  the  likelihood 
maximization  of  sequences  for  hidden  Markov  models  (Ha bluer  XS89).  The  only  difference  is 
that  here  we  have  a  weighted  log-likelihood  function,  but  with  the  weights  given  and  fixed. 
The  posterior  of  {z^}  can  be  updated  by  employing  the  dynamical  programming  techniques 
similar  to  those  used  in  HMM,  as  we  discuss  below. 

It  is  interesting  to  note  that,  with  standard  EM  employed  to  solve  (ED),  the  overall  max¬ 
imum  value  algorithm  is  a  “double-EM”  algorithm,  since  reward  computation  constitutes 
an  outer  EM-like  loop. 

4.1  Calculating  the  Posterior  of  Latent  Belief  Regions 

To  allocate  the  weights  or  recomputed  rewards  and  update  the  RPR  as  in  (ED),  we  do  not 
need  to  know  the  full  distribution  of  Zg.f .  Instead,  a  small  set  of  marginals  of  p(zjf.t\aQ.t,  o\.t,  0) 
are  necessary  for  the  purpose,  in  particular, 

=  P(4  =  Mr+l  =ilaO:i;°l:t>0)  (29) 

0t,r(*)  =  P(4  =i\aO:U°l:t,®)  (30) 

Lemma  7.  (Factorization  of  the  f  and  (j>  Variables)  Let 

af(0  =  P(4  =  i  l4r.oL,S) 
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(31) 


Then 


P(Zr  =  h  q0:tI°1:ti  ®) 

Wr’=QP{ar'\hr^&) 
P(QT+l:tl*T  =  ^r.Or+l:;.9) 

rir'=r  PiP'r'  I  hr'  ’  0) 


(32) 


ar(*)W(zr  =  Mt,°t+1>4+1  =  J>(4+1  =  J>r+l)/3tfcr+l(j)  (33) 

«?(0/3jr(0p(^l^)  (34) 


The  a  and  /?  variables  in  the  Lemma  □  are  similar  to  the  scaled  forward  variables  and 
backward  variables  in  hidden  Markov  models  (HMM)  (iKahinerl  II  U<sm).  The  scaling  factors 
here  are  [](,=0p(a(;, 0),  which  is  equal  to  p(aQ.r|°i  r>  ©)  as  shown  in  (EI3)  and  (O). 
Recall  from  Definition  □  that  one  episode  of  length  T  has  T  +  1  sub-episodes  with  each 
having  a  different  ending  time  step.  For  this  reason,  one  must  compute  the  j3  variables  for 
each  sub-episode  separately,  since  the  /3  variables  depend  on  the  ending  time  step.  For  a 
variables,  one  needs  to  compute  them  once  per  episode,  since  it  does  not  involve  the  ending 
time  step. 

Similar  to  the  forward  variables  and  backward  variables  in  HMM  models,  the  a  and  f5 
variables  can  be  computed  recursively,  via  dynamical  programming, 


ak;  (i)  = 


Kzp  =  =  l«o) 

p(ao  1*0,©) 

Ej= 1  OLr-lijW  {ZT-1  =  3,  4- 1.  °T,  4  =  jMj?  =  h  4) 

p(4\hk,Q) 


= 


P(at\ht  >  ©) 1 
Az\ 


EjJl  W ( 4  =  i,  4,  Or+l^T+l  =  JM4+ 1  =  T  fflr+l)A  r+l(j) 

p(4\hk,@) 


T  =  0 
,  r  >  0 

T  =  t 

T  <  t 


(35) 


(36) 


for  t  =  0,  •  •  •  ,  Tfc  and  k  =  1,  •  ■  ■  ,  K.  Since  Ei=i  44)  =  1)  il  follows  from  (E3)  that 


p(4\4,&) 


\Z\ 

EA4  =  iMzo  =  bao)> 

i=l 

\Z\  \Z\ 

E  E  A-lUW(zk;-l  =  3,  4-1  ,OkT,  4 

*=1  3= 1 


r  =  0 


i)n(z^  =  i,  4),  t  >  0 


(37) 


4.2  Updating  the  Parameters 

We  rewrite  the  lower  bound  in  (E23), 


1 


LB(©|0)  =  — 


K  Tk 

E 

k= 1  t= 0 


|2| 

E 


?tfc(4tl©(n))ln; 


A(a\ 


k 

0 :t>  ' 


'0:i 


°l:t 


,0) 


?=1 


?t(*<* 


0:i 


|©W) 
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1 

K 


K  Tk  \Z\ 

EE  E  Qt(4:t\®{n))  lnp(ao:t>  Zo-.t\°i:t,  ®)  +  constant 

k= 1  t= 0  ZJ?  ...  i 


(38) 


where  the  “constant”  collects  all  the  terms  irrelevant  to  0.  Substituting  (0)  and  m  gives 


K  Tk 


LB(§|0)  =  fEE 


OT 


i 


l-2| 


t  l-ZI 


i  l-2( 


Y1  ^t,o(*)  ln£W  +  ln5f(*>  of) 


T— 0 2=1 


t  \Z\  _  ^ 

+  2  X]  ^r(bj)lnW(bar-i>°r>j)  f  +  constant 

T=1  2,7=1  I 


(39) 


r=l  2j=l 

It  is  not  difficult  to  show  that  0  =  argmaxg  FLB(0|0)  is  given  by 


m(*)  = 

7r(i,  a)  = 

^(*,a,o,j)  = 


v©e.FJ 

EEESofhtoW 

E!SELiE£o44,o« 

EE,iE£o4EU#.t(0<(4,«) 

Elii  Ef=i  ESo  4  Eho  ») 

Ef=i  E£o  4  Et-\  &(i,  j)  °)  ^'(°hl  ■  °) 

Efci  EJLi  ESo  4  Eh\  .5(4,  a)  5<4+1,o) 


(40) 

(41) 

(42) 


for  r,;/  =  1,2,  ,  |Z|,  a  =  1,---  ,  |.4|,  and  o  =  l,---  ,  |C|,  where  iUa.b)  =  j  a^b  ’ 

and  of  is  the  recomputed  reward  as  defined  in  (DU).  In  computing  a f  one  employs  the 
equation  p(a,Q.t \ of:t,  0)  =  Yl'T=o P^t^t j  ©)  established  in  (HU)  and  (ED),  to  get 


d(0)^>tn^i^,0) 


(43) 


T=0 


with  p(af  |/if ,  0)  computed  from  the  a  variables  by  using  (EH).  Note  that  the  normalization 
constant,  which  is  equal  to  the  empirical  value  V(T>(K';  0),  is  now  canceled  in  the  update 
formulae  of  0. 


4.3  The  Complete  Value  Maximization  Algorithm  for  Single- Task  RPR 
Learning 

4.3.1  Algorithmic  Description 

The  complete  value  maximization  algorithm  for  single-task  RPR  learning  is  summarized 
in  Table  □.  In  earlier  discussions  regarding  the  relations  of  the  algorithm  to  EM,  we  have 
mentioned  that  reward  computation  constitutes  an  outer  EM-like  loop;  the  standard  EM 
employed  to  solve  (EH)  is  embedded  in  the  outer  loop  and  constitutes  an  inner  EM  loop. 
The  double  EM  loops  are  not  explicitly  shown  in  Table  [D.  However,  one  may  separate 
these  two  loops  by  keeping  {of}  fixed  when  updating  0  and  the  posterior  of  Es,  until  the 
empirical  value  converges;  see  (ESI)  for  details.  Once  {of}  are  updated,  the  empirical  value 
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will  further  increase  by  continuing  updating  0  and  the  posterior  of  z’s.  Note  that  the  {cr^} 
used  in  the  convergence  check  are  always  updated  at  each  iteration,  even  though  the  new 
{&*}  may  not  be  used  for  updating  0  and  the  posterior  of  z’s. 


Table  1:  The  value  maximization  algorithm  for  single-task  RPR  learning 

Input:  T>^k\  A ,  O,  \Z\. 

Output:  Q  =  {fj,,ir,W}. 

1.  Initialize  0,  t  =  [],  iteration  =  1. 

2.  Repeat 

2.1  Dynamical  programming: 

Compute  q  and  /3  variables  with  equations  (ESI)  (ESI)  (EE3) . 

2.2  Reward  re-computation: 

Calculate  {of}  using  (S3)(B3). 

2.3  Convergence  check: 

Compute  ^(iteration)  =  R(P(^;0)  using  (EDI). 

If  the  sequence  of  l  converges 
Stop  the  algorithm  and  exit. 

Else 

iteration  :=  iteration  +  1 

2.4  Posterior  update  for  z: 

Compute  the  £  and  4>  variables  using  equations  (ESI)  (ESI). 

2.5  Update  of  0: 

Compute  the  updated  0  using  (EI0)(ED])(EE3). 


Given  a  history  of  actions  and  observations  (ao:t-i,  out)  collected  up  to  time  step  t,  the 
single  RPR  yields  a  distribution  of  at  as  given  by  (0).  The  optimal  choice  for  at  can  be 
obtained  by  either  sampling  from  this  distribution  or  taking  the  action  that  maximizes  the 
probability. 

4.3.2  Time  Complexity  Analysis 

We  quantify  the  time  complexity  by  the  number  of  real  number  multiplications  performed 
per  iteration  and  present  it  in  the  Big-0  notation.  Since  there  is  no  compelling  reason  for 
the  number  of  iterations  to  depend  on  the  size  of  the  input0,  the  complexity  per  iteration 
also  represents  the  complexity  of  the  complete  algorithm.  A  stepwise  analysis  of  the  time 
complexity  of  the  value  maximization  algorithm  in  Table  □  is  given  as  follows. 

•  Computation  of  the  a  variables  with  (ESI)  and  (ESI)  runs  in  time  0{\Z\2  Y^k=i  ^fc)- 

•  Computation  of  /3’s  with  (ESI)  and  (ED)  runs  in  time  0(\Z\2  Ylk=i  YlJ-n  * /n(£  +  1)), 
which  depends  on  the  degree  of  sparsity  of  the  immediate  rewards  {r^r^  ■  ■  ■  rf1  }^=l  ■ 

In  the  worst  case  the  time  is  0{\Z\2  Ylk=i  +  1))  =  0(\Z\2  Y^k=i  ^k) >  which 

occurs  when  the  immediate  reward  in  each  episode  is  nonzero  at  every  time  step.  In 
the  best  case  the  time  is  0{\Z\2  T^),  which  occurs  when  the  immediate  reward 

2.  The  number  of  iterations  usually  depends  on  such  factors  as  initialization  of  the  algorithm  and  the 
required  accuracy,  etc. 
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in  each  episode  is  nonzero  only  at  a  fixed  number  of  time  steps  (only  at  the  last  time 
step,  for  example,  as  is  the  case  of  the  benchmark  problems  presented  in  Section  G). 

•  The  reward  re-computation  using  (EE3)  and  (E3)  requires  time  0(J2&=  i  Tk)  in  the  worst 
case  and  0(K )  in  the  best  case,  where  the  worse/best  cases  are  as  defined  above. 

•  Update  of  0  using  (SO),  (SH),  and  (S3),  as  well  as  computation  of  the  £  and  0 
variables  using  (E3)  and  (ESI),  runs  in  time  0(\Z\2  J2/^=i  T2)  in  the  worst  case  and 
0(\Z\2  Tk)  in  the  best  case,  where  the  worse/best  cases  are  defined  above. 

Since  £\=1  Tk  |*4||0|  in  general,  the  overall  complexity  of  the  value  maximization  algo¬ 
rithm  is  0(\Z\2  Ylk=\  T^)  in  the  worst  case  and  0(\Z\2  Tk)  in  the  best  case,  depending 

on  the  degree  of  sparsity  of  the  immediate  rewards.  Therefore  the  algorithm  scales  linearly 
with  the  number  of  episodes  and  to  the  square  of  the  number  of  belief  regions.  The  time 
dependency  on  the  lengths  of  episodes  is  between  linear  and  square.  The  sparser  the  im¬ 
mediate  rewards  are,  the  more  the  time  is  towards  being  linear  in  the  lengths  of  episodes. 

Note  that  in  many  reinforcement  problems,  the  agent  does  not  receive  immediate  rewards 
at  every  time  step.  For  the  benchmark  problems  and  maze  navigation  problems  considered 
in  Section  D.  the  agent  receives  rewards  only  when  the  goal  state  is  reached,  which  makes 
the  value  maximization  algorithm  scale  linearly  with  the  lengths  of  episodes. 


5.  Multi-Task  Reinforcement  Learning  (MTRL)  with  RPR 

We  formulate  our  MTRL  framework  by  placing  multiple  RPRs  in  a  Bayesian  setting  and 
develop  techniques  to  learn  the  posterior  of  each  RPR  within  the  context  of  all  other  RPRs. 

Several  notational  conventions  are  observed  in  this  section.  The  posterior  of  0  is  ex¬ 
pressed  in  terms  of  probability  density  functions.  The  notation  Go(0)  is  reserved  to  denote 
the  density  function  of  a  parametric  prior  distribution,  with  the  associated  probability  mea¬ 
sure  denoted  by  Go  without  a  parenthesized  0  beside  it.  For  the  Dirichlet  process  (which  is 
a  nonparametric  prior),  Go  denotes  the  base  measure  and  Go(0)  denotes  the  corresponding 
density  function.  The  twofold  use  of  Go  is  for  notational  simplicity;  the  difference  can  be 
easily  discerned  by  the  presence  or  absence  of  a  parenthesized  0.  The  <5  is  a  Dirac  delta  for 
continuous  arguments  and  a  Kronecker  delta  for  discrete  arguments.  The  notation  <5© .  is 

the  Dirac  measure  satisfying  5ej(dQm)  =  {  otherwTse  ' 


5.1  Basic  Bayesian  Formulation  of  RPR 


Consider  M  partially  observable  and  stochastic  environments  indexed  by  m  =  1,  2  •  ■  ■  ,  M, 
each  of  which  is  apparently  different  from  the  others  but  may  actually  share  fundamental 
common  characteristics  with  some  other  environments.  Assume  we  have  a  set  of  episodes 


collected  from  each  environment,  T>1 


{< 


(Km)  m,k  m,k  m,k  m,k  m,k  m,k  m,k  m,k 


=  S  a. 


CLrp 

k  rv 


r  rp 
,k  -‘-rr, 


K„ 


k= 1 


for  m  =  1,2,  ,  M,  where  Tm:k  represents  the  length  of  episode  k  in  environment  m. 

Following  the  definitions  in  Section  □.  we  write  the  empirical  value  function  of  the  m-th 
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environment  as 


Km  Tm.k 


V(v£^;Om)  = 


k= 1  t= 0 


(44) 


for  m  =  1,2,  ■■■  ,  M,  where  0m  =  are  the  RPR  parameters  for  the  m-th 

individual  environment. 

Let  Go(Qm)  represent  the  prior  of  0m,  where  Go(0)  is  assumed  to  be  the  density 
function  of  a  probability  distribution.  We  define  the  posterior  of  0m  as 


p(@m\v£™\G0) 


Def. 


v(p£m)-,em)G0(em) 

vGo(v\nm)) 


(45) 


where  the  inclusion  of  Go  in  the  left  hand  side  is  to  explicitly  indicate  that  the  prior  being 
used  is  Gq,  and  VGo ('Din"1'1)  is  a  normalization  constant 


VGo{V\^) 


Def. 


J  V(V^-Qm)Go(@m)d@r 


(46) 


which  is  also  referred  to  as  the  marginal  empirical  value®,  since  the  parameters  0m  are 
integrated  out  (marginalized).  The  marginal  empirical  value  VGo {'Drn  ™'1)  represents  the 
accumulated  discounted  reward  in  the  episodes,  averaged  over  infinite  RPR  policies  inde¬ 
pendently  drawn  from  Go- 

Equation  (El)  is  literally  a  normalized  product  of  the  empirical  value  function  and  a 
prior  Go(0m).  Since  f  p{Qm\T>mm\ Go)d0m  =  1,  (S3)  yields  a  valid  probability  density, 
which  we  call  the  posterior  of  0m  given  the  episodes  T>mrn\  It  is  noted  that  (S3)  would  be 
the  Bayes  rule  if  V (T)f,fm')  ;  0m)  were  a  likelihood  function.  Since  V(T>mm'> ;  0m)  is  a  value 
function  in  our  case,  (S3)  is  a  somewhat  non-standard  use  of  Bayes  rule.  However,  like  the 
classic  Bayes  rule,  (S3)  indeed  gives  a  posterior  whose  shape  incorporates  both  the  prior 
information  about  0m  and  the  empirical  information  from  the  episodes. 

Equation  (S3)  has  another  interpretation  that  may  be  more  meaningful  from  the  per¬ 
spective  of  standard  probability  theory.  To  see  this  we  substitute  (S3)  into  (S3)  to  obtain 


p(0ro|2$HGo)  = 


1  x^Km  ~m,k  ,  m,k,  m,k  p. 

Lt=l  Lt=0  rt  P\a0:t  I°1  :t  ;0m)Gp(0r 
VGo(v£m)) 

^  L^=l  Lt=0  ut  PV&m \d0.t  ,Ovt  ,  Go) 
VGo{V^m)) 


(47) 

(48) 


where 


m,k 


~m,k  /  m,k\  m.k  ^  \ 

rt  p(a0-.i  \°1  :t  >Go) 

~m,k  J Qm)Go(Om)dOm 


3.  The  term  “marginal”  is  borrowed  from  the  probability  theory.  Here  we  use  it  to  indicate  that  the 
dependence  of  the  value  on  the  parameter  is  removed  by  integrating  out  the  parameter. 
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(49) 


=  J  a^k(Qm)G0(Qm)d&m 


with  crfm’fc  the  re-computed  reward  as  defined  in  (ESI)  and  therefore  is  the  averaged  re¬ 
computed  reward,  obtained  by  taking  the  expectation  of  o™'k{Qm)  with  respect  to  Go(0m)- 

In  arriving  (S3),  we  have  used  the  fact  the  RPR  parameters  are  independent  of  the 
observations,  which  is  true  due  to  the  following  reasons:  RPR  is  a  policy  concerning  gener¬ 
ation  of  the  actions,  employing  as  input  the  observations  (which  themselves  are  generated 
by  the  unknown  environment);  therefore,  observations  carry  no  information  about  the  RPR 
parameters,  i.e. ,  p(0|observations)  =  p(Q)  =  Go(0). 

It  is  noted  that  p(0m|ag")fc,  o™.'k ,  Go)  in  (S3)  is  the  standard  posterior  of  Qm  given  the 
action  sequence  a™.'k ,  and  p{Qrn\Dmrn\  Go)  is  a  mixture  of  these  posteriors  with  the  mixing 
proportion  given  by  z/”1’  .  The  meaning  of  (S3)  is  fairly  intuitive:  each  action  sequence 
affects  the  posterior  of  0m  in  proportion  to  its  re-evaluated  reward.  This  is  distinct  from 
the  posterior  in  the  classic  hidden  Markov  model  (Rabiner  1 9*9)  where  sequences  are  treated 
as  equally  important. 

Since  p(0m|P^m\  Go)  integrates  to  one,  the  normalization  constant  Vc0('Dlnm^)  is 


VaMn^) 


Km  r^'m,k 
\  ^  \  A  m.k 


k= 1  t= 0 


(50) 


We  obtain  a  more  convenient  form  of  the  posterior  by  substituting  (□)  into  (S3) 
expand  the  summation  over  the  latent  z  variables,  yielding 


p(@m\v£™\G0)  = 


1  sr^Km  sr^1m,k  ~m,k  v^|- 
Km  l^k=l  l^t=0  1 1  2^, 


'>zt 


/  m.k  m.k  \  m.k  \  n  ( r\  \ 

^,k=1P\a0:t  j  Z0:t  \°l:t  >  ©m)Go(Om) 


vGo(v£m)) 


to 


(51) 


To  obtain  an  analytic  posterior,  we  let  the  prior  be  conjugate  to  p(a™lk,  z™}k \ ,  0m). 
As  shown  by  (0),  p(a™)  ,  \°'i  't  >  ©m)  is  a  product  of  multinomial  distributions,  and  hence 

we  choose  the  prior  as  a  product  of  Dirichlet  distributions,  with  each  Dirichlet  representing 
an  independent  prior  for  a  subset  of  parameters  in  0.  The  density  function  of  such  a  prior 
is  given  by 


GO(0m) 

p{pm\v) 

p(nm\p) 

p{Wm\uj) 


p(pm\v)p{TTm\p)p(Wm\u) 


Dir  (^m(l),  •  •  ■  ,pm(\Z\) 

\z\ 

Dir ^7rm(z,  1),  •  •  •  ,nm(i,\A\)\Pi 

i= 1 

Ml  loi  Ml 

nnn °ir  i),  •  •  • ,  a,  G,  w 

a= 1  o=l  i= 1 


Mi., 


(52) 

(53) 

(54) 

(55) 


where  v  =  {rq, . .  -,v\z\},  p  =  {pi,  ■  ■  ■ ,  P\z\}  with  Pl  =  {/qq, . . . ,  pi}\A  |},  and  u  =  {w;,aio  :  i  = 
1 . . .  \Z\,a  =  1 . . .  |M|,o  =  1 . . .  \0\}  with  uita,o  =  Wi,a,o, l,  •  •  •  ,<^i,a,o,\z\}-  Substituting  the 
expression  of  Go  into  (E3),  one  gets 

p(Qm\V^\G0) 
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(56) 


1  Y^ 

Km  2-1  k=l  2-1 1= 0  2-1 


\Z\ 


where 


.,fc=i  ^ 


m.k  (  m.k\ 
t  (*0:t  ) 


p(Q 


m,k  m,k  m,k 
m\Uj0:t  >  °l:t  ’Z0:t 


VGo(V. 


( Km )> 


m,k  /  m,k\  _ 


/  m,K\ 

(Z0 :i  )  = 


~m,k 


P(a™f,  Z(yf\o™ik ,  @m)G0(@m)d@r 


i,k-  n,  r^-^)  r(E,  vT’k,t)  n  na  nKak,t)  n  ixe,  pt,a  , 

nZi  «rM)  a  r  (vrM)  nt  rxra  %aM)  a  n„  rocM) 


m,k,t\ 


=  r. 


iia  u0  a  n,  r(^(fe4)  Ua  uG  n  ixe 


m,k,t  • 
j  j  ^i,a,o,j  > 


a,  Uo  ih  r(E;  so  n„  a  n,  n,  r(«) 


m,k,t 


(57) 


represents  the  averaged  recomputed  reward  over  a  given  z  sequence  z^k,  and 


p{®m\a\ 

,  o™ik,  z™;k,  Go)  =  p(//m|um’fc't)p(7rm|pm’fc’t)p(IPm|a)m’A;’t) 

(58) 

is  the  density  of  a  product  of  Dirichlet  distributions  and  has  the  same  form  as 
(E2)  but  with  v,  p ,  u  respectively  replaced  by  p"1’*’*,  as  given  by 

Go(0)  in 

^ m.k.t 
Vi 

=  v™  +  6(z™’k-i) 

(59) 

m.k.t 
Pi, a 

t 

=  pZ  +  Y,1 Hz?*  -  i)6«*  -  a) 

T— 0 

(60) 

^ m.k.t 
CJ-  ■ 

i,a,o,j 

=  +  E  <H  Y-i  -  -  a)S{o ”•*  -  0)S(zT*  -  3) 

T=  1 

(61) 

The  normalization  constant  Vq0  (T>ln  m  ■* )  (which  is  also  the  marginal  empirical  value)  can 
now  be  expressed  as 

.  Km  Tm,k  \Z\ 

vGa(v^)  =  2_EE  E  •'rWi 

171  k=  1  £=0  ™,k  ...  m,fc — 

z0  » 

(62) 

5.2  The  Dirichlet  Process  Prior 

In  order  to  identify  related  tasks  and  introduce  sharing  mechanisms  for  multi-task  learning, 
we  employ  the  Dirichlet  process  ([b'ergusonl  1 1  IBIackwell  and  MacQueenj  1 1  lAntorhaEl 
IIT/T:  ISethuramani  Q3H23I)  as  a  nonparametric  prior  that  is  shared  by  0m,  m  =  1.  2,  •  •  •  ,  M. 
A  draw  from  a  DP  has  the  nice  property  of  being  almost  surely  discrete  (Blackwell  and 
|lVlac0ueen]  1323) ,  which  is  known  to  promote  clustering  (West  et  al.  I  9941);  therefore,  related 
tasks  (as  judged  by  the  empirical  value  function)  are  encouraged  to  be  placed  in  the  same 
group  and  be  learned  simultaneously  by  sharing  the  episodic  data  across  all  tasks  in  the 
same  group.  Assuming  the  prior  of  0m,  m  =  1,  2,  •  •  •  ,  M,  is  drawn  from  a  Dirichlet  process 
with  base  measure  Go  and  precision  a,  we  have 

0m|G  ~  G 
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G\a,  Gq  ~  DP(a,G0 ) 


(63) 


where  the  precision  a  provides  an  expected  number  of  dominant  clusters,  with  this  driven 
by  the  number  of  samples  (West!  I  Of) 2  ) .  It  usually  suffices  to  set  the  precision  a  using  the 
rule  in  (IWestl  II  ++A\ .  If  desired,  however,  one  may  also  put  a  Gamma  prior  on  a  and  draw 
from  its  posterior  (Escobar„and  Westl  13335),  which  yields  greater  model  flexibility.  Note 
the  DP  precision  is  denoted  by  the  same  symbol  as  the  a  variables  in  (EH).  The  difference 
is  easy  to  recognize,  since  the  former  is  a  single  quantity  bearing  neither  superscripts  and 
nor  subscripts  while  the  latter  represent  a  set  of  variables  and  always  bear  superscripts  and 
subscripts. 

By  marginalizing  out  G,  one  obtains  the  Polya-urn  representation  of  DP  (Blackwell  and 
IMacQueenj  1323) ,  expressed  in  terms  of  density  functions  n 

1  M 

p(Qm\@-m,a,G0)=  ^  Go (0m)  +  nr  V^(Qm-Qj),  m=l,---,M  (64) 

a+M  —  1  a+M  —  l*—* 

3= ! 
j^m 


where  the  probability  is  conditioned  on  0_m  =  {@i,  02,  ■  ■  ■  ,  0m}  \  {0m}-  The  Polya-urn 
representation  in  (El)  gives  a  set  of  full  conditionals  for  the  joint  prior  p(0i,  02,  •  •  •  ,  0m)- 
The  fact  that  G  ~  DP(a,Go )  is  almost  surely  discrete  implies  that  the  set  {0i,  02, 
■  •  • ,  0m},  which  are  iid  drawn  from  G,  can  have  duplicate  elements  and  the  number  of 
distinct  elements  N  cannot  exceed  M,  the  total  number  of  environments.  It  is  useful  to 
consider  an  equivalent  representation  of  (El)  based  on  the  distinct  elements  ffleali  .1338'). 
Let  0  =  {0i,02,- ••  ,  0jv}  represent  the  set  of  distinct  elements  of  {0i,02,--  -  ,0m}, 
with  N  <  M.  Let  c  =  {ci,  C2, . . . ,  cm}  denote  the  vector  of  indicator  variables  defined  by 
cm  =  n  iff  0m  =  Qn  and  c_m  =  {ci,  C2,  ■  ■  ■  ,  cm}  \  {cm}.  The  prior  conditional  distribution 
p(cm\c-m )  that  arises  from  the  Polya-urn  representation  of  the  Dirichlet  process  is  as  follows 
(Ma.cKacheml  II DD41) 


p{Cm\C— imC%)  — 


a 


N 


a  +  M  -  1 


Hcm ) + 


l-r 


n= 1 


a  +  M  -  1 


5(cm  -  n) 


(65) 


where  Z_m>n  denotes  the  number  of  elements  in  {i  :  Ci  =  n,i  ^  m}  and  cm  =  0  indicates  a 
new  sample  is  drawn  from  the  base  Go.  Given  cm  and  0,  the  density  of  0m  is  given  by 

N 

p(0m|cm,  0,  Go)  =  5(cm)Go(0m)  +  ^  5(cm  -  n)5(@m  -  ®n)  (66) 

n—  1 


4.  The  corresponding  expression  in  terms  of  probability  measures  (iKscoha.r  and  Wesfl  ITWfil)  is  given  by 


©m  |  @  —  rn  ,  CK,  G  Q 


a  +  M 


=IG°+  a  +  M-lZ7-*J+n*°»  m  =  1,  -  ■  ■  ,  M, 


where  8q  .  is  the  Dirac  measure. 
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5.3  The  Dirichlet  Process  Posterior 


We  take  two  steps  to  derive  the  posterior  based  on  the  representation  of  the  DP  prior  given 
by  (E3)  and  (EH).  First  we  write  the  conditional  posterior  of  cm,  V  m  €  {!,■■■  ,  M } , 


p(cm\c-m,  0,  a,  G0)  = 


f  V(v£m)-,  &m)p(@m.\Cm,  ©,  G0)p(cm\c.m,  a)d<dm 
El=0  f  V ('Dmrn)-,9m)p(®m\cm,  0,  G0)p(cm\C-m,  a)d@r, 


(67) 


which  is  rewritten,  by  substituting  (E3)  and  (Q3)  into  the  righthand  side,  to  yield  an  algo¬ 
rithmically  more  meaningful  expression 


p(cn 


TV  T>(I<m)  n  \  —  a^Go(Vrn  ^)h(cm)  +  En=l  l—m,nV(T)rn  ^  ©n)  d(cm  7l) 
C— mi  V7,  ,  Ct,  (jro)  ^  s  \  _  .7  (is  \  — 


a  VGo(V%m))  +  Ef=i  0,) 


(68) 


where  the  Vg0  C^n?"^)  is  the  marginal  empirical  value  defined  in  (SO)  and  its  expression  is 
given  by  (E2)  when  the  DP  base  has  a  density  function  as  specified  in  (E2). 

It  is  observed  from  (E3)  that  the  indicator  cm  tends  to  equal  n  if  P(P^m^;0n)  is 
large,  which  occurs  when  the  n-th  distinct  RPR  produces  a  high  empirical  value  in  the 
m-th  environment.  If  none  of  the  other  RPRs  produces  a  high  empirical  value  in  the  m-th 
environment,  cm  will  tend  to  be  equal  to  zero,  which  means  a  new  cluster  will  be  generated 
to  account  for  the  novelty.  The  merit  of  generating  a  new  cluster  is  measured  by  the 
empirical  value  weighted  by  a  and  averaged  with  respect  to  Gq.  Therefore  the  number  of 
distinct  RPRs  is  jointly  dictated  by  the  DP  prior  and  the  episodes. 

Given  the  indicator  variables  c,  the  clusters  are  formed.  Let  In(c)  =  {m  :  cm  =  n} 
denote  the  indices  of  the  environments  that  have  been  assigned  to  the  n-th  cluster.  Given 
the  clusters,  we  now  derive  the  conditional  posterior  of  0n,  V  n  €  {1,  •  •  •  ,  N}.  If  In(c)  is  an 
empty  set,  there  is  no  empirical  evidence  available  for  it  to  obtain  a  posterior,  therefore  one 
simply  removes  this  cluster.  If  In(c )  is  nonempty,  the  density  function  of  the  conditional 
posterior  of  0n  is  given  by 


P(0n|U 


]V^\Go)  = 


Eme/„(c)^(P-m);0-)Go(©n) 


me/„(c) 

1  x^Km  x~'r^m,k  ~m,k 


fEmein(c)  V(v£m);&n)G0(&n)  d&n 

m,k  m,k  i  m,k 


l^meln(c)  Km  ^k=l  2^t=0  lt  2^rn,k  ™,k_.P\a0:t  ?  z0:t  \°l:t  ’  ^nj^oi^nj 

*0  x 


E 


m£ln(c) 


VGoV>: 


{Km.^  'i 


(69) 

(70) 


where  (CO)  results  from  substituting  (03)  into  the  righthand  side  of  (E3).  Note  that  0n, 
which  represents  the  set  of  parameters  of  the  n-th  distinct  RPR,  is  conditioned  on  all 
episodes  aggregated  across  all  environments  in  the  n-th  cluster.  The  posterior  in  (E3)  has 
the  same  form  as  the  definition  in  (S3)  and  it  is  obtained  by  applying  Bayes  law  to  the 
empirical  value  function  constructed  from  the  aggregated  episodes.  As  before,  the  Bayes 
law  is  applied  in  a  nonstandard  manner,  treating  the  value  function  as  if  it  were  a  likelihood 
function. 

A  more  concrete  expression  of  (CO)  can  be  obtained  by  letting  the  DP  base  Go  have  a 
density  function  as  in  (E3), 

K5»IU^j„(c)Z>i!Cm).Go) 
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V  1 

‘  I'n,  (c)  Krn 


EKm 

k=  1 


ETm,k  ~m,k 
t= 0  rt 


y''l-zl 

/  -j  m,k 


,,k= i  ^ 


ra,/c 


(* 


Eme/„(c)^0(^m)) 


m,k\ 
0 :t  ) 


p{&n\a 


m,k 
0 :t  J 


m.k 

3l:t  > 


m.k 
Z0:t  > 


where  Vb0(p£fm^)  is  the  marginal  empirical  value  given  in  (E2I),  i/™’fe(2r™t’fe)  is  the  average 
recomputed  reward  as  given  in  (EH),  and 


p(0. 


™la0:t  > 


m,k  m,k  m,k 


o™f,z™f,G0)  =  p(//"|t;m’fc’t)p(7f”|pm’fc^)p(W't|cDm’fc’t) 


(72) 


is  the  density  of  a  product  of  Dirichlet  distributions  and  has  the  same  form  as  G o(0)  in 
(E3)  but  with  v,  p,  uj  respectively  replaced  by  as  given  by  (ESI),  (ED), 

and  (ED). 

It  is  noted  that,  conditional  on  the  indicator  variables  c  and  the  episodes  across  all 
environments,  the  distinct  RPRs  are  independent  of  each  other.  The  indicator  variables 
cluster  the  M  environments  into  N  <  M  groups,  each  of  which  is  associated  with  a  distinct 
RPR.  Given  the  clusters,  the  environments  in  the  n-th  group  merge  their  episodes  to  form 
a  pool,  and  the  posterior  of  0n  is  derived  based  on  this  pool.  Existing  clusters  may  become 
empty  and  be  removed,  and  new  clusters  may  be  introduced  when  novelty  is  detected,  thus 
the  pools  change  dynamically.  The  dynamic  changes  are  implemented  inside  the  algorithm 
presented  below.  Therefore,  the  number  of  distinct  RPRs  is  not  fixed  but  is  allowed  to  vary. 


5.4  Challenges  for  Gibbs  Sampling 

The  DP  posterior  as  given  by  (E3)  and  (EH)  may  be  analyzed  using  the  technique  of  Gibbs 
sampling  (Kleman  and  (lemanl  1 1  !JX4I:  Klelta.nd  and  SmitTil  llWill) .  The  Gibbs  sampler  succes¬ 
sively  draws  the  indicator  variables  ci,  C2,  ■  ■  ■  ,  cm  and  the  distinct  RPRs  0i,  02,  ■ ' '  , 
according  to  (ESI)  and  (EH).  The  samples  are  expected  to  represent  the  posterior  when  the 
Markov  chain  produced  by  the  Gibbs  sampler  reaches  the  stationary  distribution.  However, 
the  convergence  of  Gibbs  sampling  can  be  slow  and  a  long  sequence  of  samples  may  be 
required  before  the  stationary  distribution  is  reached.  The  slow  convergence  can  generally 
be  attributed  to  the  fact  that  the  Gibbs  sampler  implements  message-passing  between  de¬ 
pendent  variables  through  the  use  of  samples,  instead  of  sufficient  statistics  (.Iordan  el  al. 
iiSHS) .  Variational  methods  have  been  suggested  as  a  replacement  for  Gibbs  sampling  (Llor-1 
Ida.n  et  aTIlTTlUb).  Though  efficient,  variational  methods  are  known  to  suffer  from  bias.  A 
good  trade-off  is  to  combine  the  two,  which  is  the  idea  of  hybrid  variational/Gibbs  inference 
in  ([Welling  et  al.|  12(108). 

In  our  present  case,  Gibbs  sampling  is  further  challenged  by  the  particular  form  of  the 
conditional  posterior  of  0n  in  (EH),  which  is  seen  to  be  a  large  mixture  resulting  from  the 
summation  over  environment  m,  episode  k,  time  step  t,  and  latent  z  variables.  Thus  it  has 
a  total  of  Ylmein  ^k=\  Yht=o  \^\t  components  and  each  component  is  uniquely  associated 
with  a  single  sub-episode  and  a  specific  instantiation  of  latent  z  variables.  To  sample  from 
this  mixture,  one  first  makes  a  draw  to  decide  a  component  and  then  draws  Qn  from  this 
component.  Obviously,  any  particular  draw  of  0n  makes  use  of  one  single  sub-episode  only, 
instead  of  simultaneously  employing  all  sub-episodes  in  the  n-th  cluster  as  one  would  wish. 

In  essence,  mixing  with  respect  to  (m,  k,  t)  effectively  introduces  additional  latent  in¬ 
dicator  variables,  i.e.,  those  for  locating  environment  m,  episode  k,  and  time  step  t.  It  is 
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important  to  note  that  these  new  indicator  variables  play  a  different  role  than  z' s  in  affect¬ 
ing  the  samples  of  0n.  In  particular,  the  z’s  are  intrinsic  latent  variables  inside  the  RPR 
model,  while  the  new  ones  are  extrinsic  latent  variables  resulting  from  the  particular  form 
of  the  empirical  value  function  in  (53).  Each  realization  of  the  new  indicators  is  uniquely 
associated  with  a  distinct  sub-episode  while  each  realization  of  z’s  is  uniquely  associated 
specific  decision  states.  Therefore,  the  update  of  0n  based  on  one  realization  of  the  new 
indicators  employs  a  single  sub-episode,  but  the  update  based  on  one  realization  of  z’s 
employs  all  sub-episodes. 

5.5  The  Gibbs- Variational  Algorithm  for  Learning  the  DP  Posterior 

The  fact  that  the  Gibbs  sampler  cannot  update  the  posterior  RPR  samples  by  using  more 
than  one  sub-episode  motivates  us  to  develop  a  hybrid  Gibbs-variational  algorithm  for 
learning  the  posterior. 

We  restrict  the  joint  posterior  of  the  latent  z  variables  and  the  RPR  parameters  to 
the  variational  Bayesian  (VB)  approximation  that  assumes  a  factorized  form.  This  re¬ 
striction  yields  a  variational  approximation  to  p(@n\  Umei„(e)  'Dmm\  Go)  that  is  a  single 
product  of  Dirichlet  density  functions,  where  the  terms  associated  with  different  episodes 
are  collected  and  added  up.  Therefore,  updating  of  the  variational  posterior  of  0n  in  each 
Gibbs-variational  iteration  is  based  on  simultaneously  employing  all  sub-episodes  in  the 
n-th  cluster.  In  addition,  the  variational  method  yields  an  approximation  of  the  marginal 
empirical  value  Vc0{T>^m^)  as  given  in  (50). 

The  overall  Gibbs- variational  algorithm  is  an  iterative  procedure  based  on  the  DP  poste¬ 
rior  represented  by  (ESI)  and  (Ell).  At  each  iteration  one  successively  performs  the  following 
for  m  =  1,2,  ,  M.  First,  the  cluster  indicator  variable  cm  is  drawn  according  to  (ESI), 

where  Vfc0(Pijf”^)  is  replaced  by  its  variational  Bayesian  approximation;  accordingly  the 
clusters  In  =  {m  :  cm  =  n},  n  =  1  are  updated.  For  each  nonempty  cluster  n, 

the  associated  distinct  RPR  is  updated  by  drawing  from,  or  finding  the  mode  of,  the  varia¬ 
tional  Bayesian  approximation  of  p(©n|Ume/„(c)  T>\nm\  Go)-  The  steps  are  iterated  until  the 

variational  approximation  of  Yln=i  ^Go(Ume/„(c)^™'m^)  converges.  Note  that  the  number 
of  clusters  is  not  fixed  but  changes  with  the  iteration,  since  existing  clusters  may  become 
empty  and  be  removed  and  new  clusters  may  be  added  in. 

5.5.1  Variational  Bayesian  Approximation  of  Vcq(T>(k>)  and  p(Q\V^K\  G0) 

In  this  subsection  we  drop  the  variable  dependence  on  environment  m,  for  notational  sim¬ 
plicity.  The  discussion  assumes  a  set  of  episodes  =  {{aQTQo\a\r\  ■  ■  ■  °TkaTkrTk)}k=\^ 

which  may  come  from  a  single  environment  or  a  conglomeration  of  several  environments. 

We  now  derive  the  variational  Bayesian  approximation  of  the  marginal  empirical  value 
function  Vq0  as  defined  in  (50).  We  begin  by  rewriting  (50),  using  (0)  and  (53),  as 

K  Tk  \Z\ 

VGo{rtK))  =  E  p{ait,ztt\ok1:t,®)Go{Q)d®  (73) 

fc=i*=° 
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We  follow  the  general  variational  Bayesian  approach  (Jordan  et  a  1.1  I !)!)!);  llaakkolal  rziHIII; 
IBeall  rJI II Kill  Q  to  find  a  variational  lower  bound  to  In  Vg0(T>(k^),  and  the  variational  Bayesian 
approximation  of  Vg0(T>^k^)  is  obtained  as  the  exponential  of  the  lower  bound.  The  lower 
bound  is  a  functional  of  a  set  of  factorized  forms  {q^(zQ.t)g(Q)  :  z*  €  Z,t  =  1 . . .  T*. ,  k  = 
1 ...  K}  that  satisfies  the  following  normalization  constraints: 

K  Tk  jzj 

Qti&t)  =  K  and  Qt{Z0:t)  >  0  V  4t>  t,  k 

k=l  t=  1  z$,-,z*= 1 

f  g(Q)d&  =  1  and  g(0)  >  0  V0 


The  lower  bound  is  maximized  with  respect  to  {qt(zQ.t)g(@)}.  As  will  come  clear  below, 
maximization  of  the  lower  bound  is  equivalent  to  minimization  of  the  Kullback-Leibler 
(KL)  distance  between  the  factorized  forms  and  weighted  true  joint  posterior  of  z's  and  0. 
In  this  sense,  the  optimal  g(Q)  is  a  variational  Bayesian  approximation  to  the  posterior 
p(Q\V('h\ Go) .  It  should  be  noted  that,  as  before,  the  weights  result  from  the  empirical 
value  function  and  are  not  a  part  of  standard  VB  (as  applied  to  likelihood  functions). 

The  variational  lower  bound  is  obtained  by  applying  Jensen’s  inequality  to  In  Vq0  (D^), 


lnVb0(pW) 

Tfc  \Z | 


(E E  E  jiA)m 


k=  1  t=0  zk  ...  Zk  —  1  ' 


^Go(0)p(og:t,4f|o^,0) 


d& 


>  — 

-  K 


1  E  E  E  f  m  ^-Go(e)p(““' e) 


k=l  t= 0  zk  ...  zk  —  \ 

z0’  x 

lnVbo(pW)-KLNg*(4t)5(0)} 


9t(4tMe) 


d© 


p(z0:ti  ©la0:t!  °l:i) 


VGo(&K)) 


^  LB ({,*}, 9(e)) 


where  is  the  average  recomputed  reward  as  given  in  (S3),  and 


(74) 


KL  N^(4tM©)} 
if  Tfe  |.Z| 


1 

A' 


EE  E 

fc=l  t= 0  Jfe  ...  ^=1  * 

-1- 


q?(z*:t)g(Q)  In 


Vg, 


d©  (75) 


with  KL(q,||p)  denoting  the  Kullback-Leibler  distance. 

For  any  set  {<Zt  (4t)s(®)  :  G  Z,  t  =  1 . . .  T*,,  k  =  1 ...  A}  satisfying  the  above  normal¬ 

ization  constraints,  the  inequality  in  (E3I)  holds.  In  order  to  obtain  the  lower  bound  that  is 


5.  The  standard  VB  applies  to  a  likelihood  function.  Since  we  are  using  a  value  function  instead  of  a 
likelihood  function,  the  VB  derivation  here  is  not  a  standard  one,  just  as  the  Bayes  rule  in  (S3)  is 
non-standard. 
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closest  to  In  V (' D ^),  one  maximizes  the  lower  bound  by  optimizing  ({ qk }  ,  g(@))  subject  to 
the  normalization  constraints.  Since  In  Vg0('D(k^)  is  independent  of  ©  and  {qk},  it  is  clear 
that  maximization  of  the  lower  bound  LB  ( { }  ,  g(Q))  is  equivalent  to  minimization  of  the 

KL  distance  between  {qk(4:t)9(®)}  and  the  weighted  posterior  i<))P(z0:t , 

iyk  jjk 

where  the  weight  for  episode  k  at  time  step  t  is  ^ — 1 =  K — ^ ^ — -  (the  equa¬ 
te  (®(i°)  Ef=i  ESo  ut 

tion  results  directly  from  (ED)),  i.e. ,  K  times  the  fraction  that  the  average  recomputed 
reward  vk  occupies  in  the  total  average  recomputed  reward.  Therefore  the  factorized  form 
{ qt(zo:t)g{ ©)}  represents  an  approximation  of  the  weighted  posterior  when  the  lower  bound 
reaches  the  maximum,  and  the  corresponding  g(Q)  is  called  the  approximate  variational 
posterior  of  0. 

The  lower  bound  maximization  is  accomplished  by  solving  {qk(z o-t)}  and  q(0)  alter¬ 
nately,  keeping  one  fixed  while  solving  for  the  other,  as  shown  in  Theorem  0. 


Theorem  8.  Iteratively  applying  the  following  two  equations  produces  a  sequence  of  mono- 
tonically  increasing  lower  bounds  LB  ,g(Q)),  which  converges  to  a  maxima, 


Qt(zo-.t)  =  c  exPj  /  #(©)  hip(ag:t,  ©)  d© 


9(0)  =  exp  ( f  E  E  E  4(44  inf,V(a„‘„  4<i»h,  @) 


K  Tk 


l-2| 


k= 1  t— 0  zk  •••  z+=  1 
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K:,0)} 

(77) 

1  and 

EILe&oEU1...  = 

z0’  —  1 


It  is  seen  from  (EH)  that  the  variational  posterior  g(Q)  takes  the  form  of  a  product, 
where  each  term  in  the  product  is  uniquely  associated  with  a  sub-episode.  As  will  be  clear 
shortly,  the  terms  are  properly  collected  and  the  associated  sub-episodes  simultaneously 
employed  in  the  posterior.  We  now  discuss  the  computations  involved  in  Theorem  0. 

Calculation  of  {qk{4:t)}  We  uses  the  prior  of  ©  as  specified  by  (E2).  It  is  not  difficult 
to  verify  from  (E3)  that  the  variational  posterior  g(Q)  takes  the  same  form  as  the  prior,  i.e., 


5(0)  =  p(g\v)p{Tr\p)p(W\uj) 


(78) 


where  the  three  factors  respectively  have  the  forms  of  (E3),(E3),  and  (ESI);  we  have  put  a 
hat  ^  above  the  hyper-parameters  of  g(Q)  to  indicate  the  difference  from  those  of  the  prior. 
Substituting  (D)  and  (ED)  into  (IZD),  we  obtain 


c. 


expi  J2  <  ln7r(Zr>  ar)>p(7r|p)  +  <  ln^Zo))p^)+J2  (  ln  W  (4-1, 4-1, 4,  4)) 


V  T— 0 
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=  n  W(4_1,a^_1,0^,  zkMzk,  akT) 


p(W  |2)J 
(79) 


T— 1 
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where  OpMjo)  denotes  taking  expectation  with  respect  to  p( 7r|p),  and 


mO') 


ir(i,  m) 


W(i,  a,  o,j) 


each  of  which  is  a  finite  set  of  nonnegative  numbers  with  a  sum  less  than  one.  Such  a 
finite  set  is  called  under-normalized  probabilities  in  (IHea.1l  rzi )( l.il)  and  used  there  to  perform 
variational  Bayesian  learning  of  hidden  Markov  models  (HMM).  The  'ip(-)  is  the  digamma 
function. 

It  is  interesting  to  note  that  the  product  p(zq)tt(zq,  aft)  nt=i  W (4-i>  ar-i’  °ti  zt)^(zt- ar ) 
on  the  left  side  of  (ED)  has  exactly  the  same  form  as  the  expression  of  p(a Q.t,  z^.t \ o\.t ,  0)  in 
(0),  except  that  the  0  is  replaced  by  0  =  {p,i t,  W}.  Therefore,  one  can  nominally  rewrite 
(ED)  as 


Qt(Z0:t)  —  (j  p(a0:ti  z0:t\°l:ti  ®) 

with  the  normalization  constant  given  by 

1  K  Tk  jZj 

c^kEE  E  4.14,. ®) 

k= 1  t— 0  zk  ...  zk  —  i 


(83) 


(84) 


such  that  the  constraint  Efc=i  EEo  E  k  k  qt(zn.t)  =  K  is  satisfied.  One  may  also  find 
that  the  normalization  constant  C~  is  a  nominal  empirical  value  function  that  has  the  same 
form  as  the  empirical  value  function  in  (ED).  The  only  difference  is  that  the  normalized  0 
is  replaced  by  the  under-normalized  0.  Therefore,  one  may  write 

CZ  =  V(V^-Q)  (85) 


Since  0  =  {p,  n,  W}  are  under-normalized,  p(ag.t,  ©)  is  n°t  a  proper  probability  dis¬ 

tribution.  However,  one  may  still  write  p(a,Q.t,  ZQ.t\o\.t,  0)  =  p{aQ.t\o\.t,  ©)f>(~o-tl°o-t >  °i-v  ®)> 

k  I  „k  O';  V^l^l  r.(  rM  -yk  |  (Zl\  o  n  ^1  |  &  rfc  CX\  P(Q0 :  t  V ZQ :  1 I  °1 : t  ’  0) 


where p(ag:t I o£t,0)  =  E^  ...  **>_!  P(o§:t»  ©)  andP(4tla0:t>  °1:(.  ©) 

z0’  —  1 
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Note  that  p(4:t\ao:t>  o\.t,  ©)  is  a  proper  probability  distribution.  Accordingly,  q^(zQ.t)  can 
be  rewritten  as 
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is  called  variational  re-computed  reward,  which  has  the  same  form  as  the  re-computed 
reward  given  in  (113)  but  with  0  replaced  by  0.  The  second  equality  in  (EZI)  is  based  on 
the  equation  p(aQ.t\o\.t,  0)  =  nt=o7'(arl^r)  ®)  established  in  (E3)  and  (O).  The  nominal 
empirical  value  function  V(T>(K^;0)  can  now  be  expressed  in  terms  of  <7^(0), 
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Equation  (EEI)  shows  that  qt(zQ.t )  is  a  weighted  posterior  of  Zq.j. 


(B3),  can  be  equivalently  expressed  as 
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The  weighted  posterior  has  the  same  form  as  (113)  used  in  single-task  RPR  learning.  There¬ 
fore  we  can  borrow  the  techniques  developed  there  to  compute  the  marginal  distributions  of 
p(zQ.t\aQ.t,  o\.t,  0),  particularly  those  defined  in  (E3)  and  (GO).  For  clarity,  we  rewrite  these 
marginal  distributions  below  without  re-deriving  them,  with  0  replaced  by  0, 


4,r  (ij)  =  p(4  =  h4+i  =  j\aO:V°l:t,&)  (90) 

0t,r(*)  =  P(4  (91) 


These  marginals  along  with  the  {^(0)}  defined  in  (E3)  will  be  used  below  to  compute  the 
variational  posterior  g(Q). 

Calculation  of  the  Variational  Posterior  g(@)  To  compute  #(0),  one  substitutes  (D) 
and  (S3)  into  (EZI)  and  performs  summation  over  the  latent  ^  variables.  Most  z  variables  are 
summed  out,  leaving  only  the  marginals  in  (EO)  and  (EEO).  Employing  these  marginals  and 
taking  into  account  the  weights  {Krj^Q)},  the  variational  posterior  (with  7^(0)  abbreviated 
as  for  notational  simplicity)  is  obtained  as 
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where  p(/i|u),  p(7r|p),  p(W\Cj)  have  the  same  forms  as  in  (E3),  (E31),  and  (E3),  respectively, 
but  with  the  hyper-parameters  replaced  by 
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for  i,  7  =  1, . . . ,  \Z\,  a  =  1, . . . ,  |*4.|,  o  =  1, . . . ,  \0\.  Note  that,  for  simplicity,  we  have  used 
{u,  p,  cD}  to  denote  the  hyper-parameters  of  g(@)  for  both  before  and  after  the  updates  in 
(E3)-(E3)  are  made.  It  should  be  kept  in  mind  that  the  Ty’s,  ^>’s,  and  £’s  are  all  based  on  the 
numerical  values  of  {v,p,Q}  before  the  updates  in  (E3)-(E3)  are  made,  i.e. ,  they  are  based 
on  the  { v ,  p,  cD}  updated  in  the  previous  iteration. 

It  is  clear  from  (E3I)-(E3)  that  the  update  of  the  variational  posterior  is  based  on  using  all 
episodes  at  all  time  steps  (i.e.,  all  sub-episodes).  The  gk  can  be  thought  of  as  a  variational 
soft  count  at  time  t  of  episode  k ,  which  is  appended  to  the  hyper-parameters  (initial  Dirichlet 
counts)  of  the  prior.  Each  decision  state  £  receives  rjk  in  the  amount  that  is  proportional 
to  the  probability  specified  by  the  posterior  marginals  { (f)k T }  and  {£fr_i}. 

Computation  of  the  Lower  Bound  To  compute  the  lower  bound  LB({^},  g(Q))  given 
in  (E3I),  one  first  takes  the  logarithm  of  (EEI)  to  obtain 
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which  is  then  substituted  into  the  right  side  of  (IA-1  lh  in  the  Appendix  to  cancel  the  first 
term,  yielding 
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=  lnF(pW;0)-KL(5(0)||Go(0))  (97) 

where  the  last  equality  follows  from  (ESI). 

The  lower  bound  yields  a  variational  approximation  to  the  logarithm  of  the  marginal 
empirical  value.  As  variational  Bayesian  learning  proceeds,  the  lower  bound  monotonically 
increases,  as  guaranteed  by  Theorem  0,  and  eventually  reaches  a  maxima,  at  which  point 
one  obtains  the  best  (assuming  the  maxima  is  global)  variational  approximation.  By  taking 
exponential  of  the  best  lower  bound,  one  gets  the  approximated  marginal  empirical  value. 
The  lower  bound  also  provides  a  quantitative  measure  for  monitoring  the  convergence  of 
variational  Bayesian  learning. 

5.5.2  The  Complete  Gibbs- Variational  Learning  Algorithm 

Algorithmic  Description  A  detailed  algorithmic  description  of  the  complete  Gibbs- 
variational  algorithm  is  given  in  Table  □.  The  algorithm  calls  the  variational  Bayesian  (VB) 
algorithm  in  Table  E3  as  a  sub-routine,  to  find  the  variational  Bayesian  approximations  to 
intractable  computations.  In  particular,  the  marginal  empirical  value  Vq0  in  (ESI)  is 

approximated  by  the  exponential  of  the  variational  lower  bound  returned  from  the  VB  algo¬ 
rithm  by  feeding  the  episodes  ’D.rn  m  '1 .  The  conditional  posterior  p(@n\Um&in(c)'Drnm\  Go) 
in  (ESI)  is  approximated  by  the  variational  posterior  g(Qn )  returned  from  the  VB  algorithm 
by  feeding  the  episodes  U  The  variational  approximation  of  Vg0('D^"1'>)  need 

be  computed  only  once  for  each  environment  m,  before  the  main  loop  begins,  since  it  solely 
depends  on  the  DP  base  Go  and  the  episodes,  which  are  assumed  given  and  fixed.  The 
variational  posterior  g(Qn)  and  Vg0  ((J however,  need  be  updated  inside 
the  main  loop,  because  the  clusters  {In(c)}  keep  changing  from  iteration  to  iteration. 

Upon  convergence  of  the  algorithm,  one  obtains  variational  approximations  to  the  poste¬ 
riors  of  distinct  RPRs  {g,(0n)}^r=1,  which  along  with  the  cluster  indicators  {ci,  C2,  ■  ■  ■  ,  cm} 
give  the  variational  posterior  g(Qm )  for  each  individual  environment  m.  By  simple  post¬ 
processing  of  the  posterior,  we  obtain  the  mean  or  mode  of  each  0m,  which  gives  a  single 
RPR  for  each  environment  and  yields  the  history-dependent  policy  as  given  by  (0).  Al¬ 
ternatively,  one  may  draw  samples  from  the  variational  posterior  and  use  them  to  produce 
an  ensemble  of  RPRs  for  each  environment.  The  RPR  ensemble  gives  multiple  history- 
dependent  policies,  that  are  marginalized  (averaged)  to  yield  the  final  choice  for  the  action. 

It  should  be  noted  that  the  VB  algorithm  in  Table  E3  can  also  be  used  as  a  stand-alone 
algorithm  to  find  the  variational  posterior  of  the  RPR  of  each  environment  independently 
of  the  RPRs  of  other  environments.  In  this  respect  the  VB  is  a  Bayesian  counterpart  of  the 
maximum  value  (MV)  algorithm  for  single-task  reinforcement  learning  (STRL),  presented 
in  Section  0  and  Table  □. 

Time  Complexity  Analysis  The  time  complexity  of  the  VB  algorithm  in  Table  E3  is 
given  as  follows  where,  as  in  Section  1 l  the  complexity  is  quantified  by  the  number  of 
real  number  multiplications  in  each  iteration  and  is  presented  in  the  Big-0  notation.  For  the 
reasons  stated  in  Section  l  L.'LM  the  complexity  per  iteration  also  represents  the  complexity 
of  the  complete  algorithm. 
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Table  2:  The  Gibbs- variational  algorithm  for  learning  the  DP  posterior 


Input:  D  A  \zl  a 

Output:  {nn,Pn,^n}^Li  with  N  <  M  and  {ci,c2,-~-  ,cm}. 


1.  Computing  the  variational  approximations  of  {VcQ(T>rnrn'1)}: 

1.1  for  m  =  1  to  M 

Call  the  algorithm  in  Table  □,  with  the  inputs  T>[nm\  A  O,  \Z\, 

{v,p,  a;}.  Record  the  returned  hyper-parameters  as  {Sm,pm,wm}  and  the 

approximate  Vq0  ) . 

2.  Initializations:  Let  j  =  1,  N=M,  £  =  0. 

Let  vn  =  vn,  pn  =  pn ,  UJn  =  tD„,  for  n  =  1,  ■  ■  ■  ,N. 

3.  Repeat 

3.1  For  n  =  1  to  N 


Update  0n  by  drawing  from,  or  finding  the  mode  of,  the  Go  with  hyper¬ 
parameters  { vn  ,~pn,UJn}. 

3.2  For  m  =  1  to  M 

=  cm  and  draw  cm  according  to  (ESI). 


Let  -old 


If 


Cm  7^ 

If  Cm. 


„old 


=  0,  start  a  new  cluster  Ijv+i(c)  =  {m}. 


Elseif  Cm  /  0,  move  the  element  m  from  7coid(c)  to  7< 


Cm  V 


For  n  = 

If 


„old  _  \ 

cm  !  Cm  J 


4(c)  is  an  empty  set 
Delete  the  n-th  cluster. 

Elseif  4(c)  contain  a  single  element  (let  it  be  denoted  by  rr\! ) 

Let  Vn  =  Vm / ,  Pn  =  Pm’ ,  =  Wm/ .  Add  Vg0  ) 

Else 


E 


to  £(j). 


Call  the  algorithm  in  Table  □.  with  the  inputs 

A.  O,  \Z\,  {v,p,uo}.  Record  the  returned  hyper-parameters  as 

{vn,pn,uJn}.  Scale  the  returned  Vc0(\J,€Tn(c)'Dil  K''>) 

by  and  add  the  result  to  £(j). 

If  In(c)  is  not  empty 

Draw  0n  drawn  from  Go  with  hyper-parameters  {vn,pn,uJn}. 

3.3  Updating  N: 

Let  N  be  the  number  of  nonempty  clusters  and  renumber  the 
nonempty  clusters  so  that  their  indices  are  in  {1,  2,  •  •  •  ,N}. 

3.4  Convergence  check: 

If  the  sequence  of  £  converges 
stop  the  algorithm  and  exit. 

Otherwise 

Set  j  :=  j  +  1  and  £(j)  =  0. 


•  The  computation  of  0  based  on  equations  (EDI),  (ED),  and  (E3)  runs  in  time  0(\Z\), 
0(|^4||Z|),  and  0(|„4.| \0\ \Z\2),  respectively. 


•  Computation  of  the  a  variables  with  (E3)  and  (ED)  (with  0  replaced  by  0)  runs  in 
time  0{\Z\2Y,k=1Tk). 
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Table  3:  The  variational  Bayesian  learning  algorithm  for  RPR 

Input:  T>(k\  A,  O,  \Z\,  {v,p,  w}. 

Output:  {v,p,Q},  PGn(P(A))  ss  LB({gffc},g(0)). 

1.  Initialize  v  =  v,  p  =  p,  cD  =  ui,  £  =  [],  iteration  =  1. 

2.  Repeat 

2.1  Computing  0: 

Compute  the  set  of  under-normalized  probabilities  0  using 
equations  (EO)  (EH)  (E3) . 

2.2  Dynamical  programming: 

Compute  a  and  /3  variables  with  (ESI)  (ED)  (E2) ,  with  0  replaced 

by  0  in  these  equations. 

2.3  Reward  re-computation: 

Calculate  the  variational  recomputed  reward  {a^ (0)}  using 

(E3)(E3)  and  compute  the  weight  {r/f  (©)}  using  (ESI). 

2.4  Lower  bound  computation: 

Calculate  the  variational  lower  bound  LB({g^}, g(Q))  using 

(02)  (ESI). 

2.5  Convergence  check: 

Let  ^(iteration)  =  LB({^fc},  g(@)). 

If  the  sequence  of  l  converges 
Stop  the  algorithm  and  exit. 

Else 

Set  iteration  :=  iteration  +  1. 

2.6  Posterior  update  for  z: 

Compute  the  £  and  (j)  variables  using  equations  (ESI)  (EH). 

2.7  Update  of  hyper-paramters: 

Compute  the  updated  {v,p,u)}  using  (031)  (OH)  (03). 


•  Computation  of  the  f3  variables  with  (ED)  and  (E3)  (with  0  replaced  by  0)  runs  in 
time  0(\Z\2  Ef=0,rtV +  1^’  which  is  °(\Z\2  Ef=i  Tl )  in  the  worst  case  and 

is  0(\Z\2  Ylk= l  Tk)  hr  the  best  case,  where  the  worst  and  best  cases  are  distinguished 
by  the  sparseness  of  immediate  rewards,  as  discussed  in  Section  EL2I. 

•  The  reward  re-computation  using  (E3),  (E3),  and  (ED)  requires  time  0(J2^=  \Tk)  in 
the  worst  case  and  O(K)  in  the  best  case. 

•  Computation  of  the  lower  bound  using  (ESI)  and  (03)  requires  time  0(|„4| \0\ \Z\2). 

•  Update  of  the  hyper-parameters  using  (031),  (EH),  and  (03),  as  well  as  computation  of 

the  £  and  4>  variables  using  (E3)  and  (EH),  runs  in  time  0{\Z\2  J2k= i  ^~fc)  in  worst 
case  and  0(\Z\2  ^k)  in  the  best  case. 

The  overall  complexity  of  the  VB  algorithm  is  seen  to  be  0(\Z\2  ^k)  ai  worst  case 

and  0(\Z\2  J2k= i  in  the  best  case,  based  on  the  fact  that  J2k= l  |*4||0|  in  general. 

Thus  the  YB  algorithm  has  the  same  time  complexity  as  the  value  maximization  algorithm 
in  Table  EL  Note  that  the  time  dependency  on  the  lengths  of  episodes  is  dictated  by  the 
sparseness  of  the  immediate  rewards;  for  most  problems  considered  in  Section  0,  the  agent 
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receives  rewards  only  when  the  goal  state  is  reached,  in  which  case  the  VB  algorithm  scales 
linearly  with  the  lengths  of  episodes. 

The  complexity  of  the  Gibbs-variational  algorithm  can  be  easily  obtained  based  on  the 
complexity  analysis  above  for  the  VB  algorithm.  At  the  beginning  and  before  entering  the 
main  loop,  the  Gibbs- variational  algorithm  calls  the  VB  to  compute  the  variational  approx¬ 
imation  of  the  marginal  empirical  value  { VQ0{T>m m^)}  for  each  environment  rn,  by  feeding 
the  associated  episodes  T>mm^ .  These  computations  are  performed  only  once.  For  each  envi¬ 
ronment  the  VB  runs  until  convergence,  with  a  time  complexity  between  0(\Z\2  Tm.k) 
and  0(\Z\2  Ylk=i  k)  Per  iteration,  depending  on  the  sparseness  of  the  immediate  re¬ 
wards.  Inside  the  main  loop,  the  Gibbs-variational  algorithm  calls  the  VB  to  compute 
the  variational  posterior  of  distinct  RPR  for  each  cluster  n,  by  feeding  the  merged  episodes 
U»ne/„(c)^m  m  •  These  computations  are  performed  each  time  the  clusters  are  updated,  with 

a  time  complexity  between  0(\Z\2  £meJn(c)  EfcS  Tm,k)  and  0(\Z\2  Eme/„(c)  Ef=i  Tm,k) 
per  iteration  for  cluster  n. 

6.  Experimental  Results 

We  compare  the  performance  of  RPR  in  multi-task  reinforcement  learning  (MTRL)  versus 
single-task  reinforcement  learning  (STRL),  and  demonstrate  the  advantage  of  MTRL.  The 
RPR  for  MTRL  is  implemented  by  the  Gibbs-variational  algorithm  in  Table  □  and  the 
RPR  for  STRL  is  implemented  by  the  maximum-value  (MV)  algorithm  in  Table  □.  The 
variational  Bayesian  (VB)  algorithm  in  Table  □,  which  is  a  Bayesian  counterpart  of  the 
MV  algorithm,  generally  performs  similar  to  the  MV  for  STRL  and  is  thus  excluded  in  the 
comparisons. 

Since  the  MV  algorithm  is  a  new  technique  developed  in  this  report,  we  evaluate  the 
performance  of  the  MV  before  proceeding  to  the  comparison  of  MTRL  and  STRL.  We  also 
compare  the  MV  to  the  method  of  first  learning  a  POMDP  model  from  the  episodes  and 
then  finding  the  optimal  policy  for  the  POMDP. 

6.1  Performance  of  RPR  in  Single- Task  Reinforcement  Learning  (STRL) 

We  consider  the  benchmark  example  Hallway2,  introduced  in  (Liftman  et  alJ  11111)51).  The 
Hallway2  problem  was  originally  designed  to  test  algorithms  based  on  a  given  POMDP 
model,  and  it  has  recently  been  employed  as  a  benchmark  for  testing  model-free  reinforce¬ 
ment  algorithms  ( Rakker  .21101;  Wierstra  and  W  ieringj  EDD4) . 

Hallway2  is  a  navigation  problem  where  an  agent  is  situated  in  a  maze  consisting  of  a 
number  of  rooms  and  walls  that  are  connected  to  each  other  and  the  agent  navigates  in 
the  maze  with  the  objective  of  reaching  the  goal  within  a  minimum  number  of  steps.  The 
maze  is  characterized  by  92  states,  each  representing  one  of  four  orientations  (south,  north, 
east,  west)  in  any  of  23  rectangle  areas,  and  four  of  the  states  (corresponding  to  a  single 
rectangle  area)  represent  the  goal.  The  observations  consist  of  24  =  16  combinations  of 
presence/absence  of  a  wall  as  viewed  when  standing  in  a  rectangle  facing  one  of  the  four 
orientations,  and  there  is  an  observation  uniquely  associated  with  the  goal.  There  are  five 
different  actions  that  the  agent  can  take:  {stay  in  place,  move  forward,  turn  right,  turn  left, 
turn  around}.  Both  state  transitions  and  observations  are  very  noisy  (uncertain),  except 


34 


that  the  goal  is  fully  identified  by  the  unique  observation  associated  with  it.  The  reward 
function  is  defined  in  such  a  way  that  a  large  reward  is  received  when  the  agent  enters 
the  goal  from  the  adjacent  states,  and  zero  reward  is  received  otherwise.  Thus  the  reward 
structure  is  highly  sparse  and  both  the  MTRL  and  STRL  algorithms  scale  linearly  with  the 
lengths  of  episodes  in  this  case,  as  discussed  in  Sections  ll  ' >  ■’l  and  h.ti.'A 

6.1.1  Performance  as  a  Function  of  Number  of  Episodes 

We  investigate  the  performance  of  the  RPR,  as  a  function  of  K  the  number  of  episodes 
used  in  the  learning.  For  each  given  K,  we  learn  a  RPR  from  a  set  of  K  episodes  T>^  that 
are  generated  by  following  the  behavior  policy  II,  and  the  learning  follows  the  procedures 
described  in  Section  EH. 

The  conditions  for  the  policy  II,  as  given  in  Theorem  0,  are  very  mild,  and  are  satisfied 
by  a  uniformly  random  policy.  However,  a  uniformly  random  agent  may  take  a  long  time  to 
reach  the  goal,  which  makes  the  learning  very  slow.  To  accelerate  learning,  we  use  a  semi¬ 
random  policy  n,  which  is  simulated  by  the  rule  that,  with  probability  pquery,  II  chooses 
an  action  suggested  by  the  PBVI  algorithm  (Piuea.11  et  ah  20031)  and,  with  probability 
1  —  Pqueryi  hi  chooses  an  action  uniformly  sampled  from  A.  The  use  of  PBVI  here  is  similar 
to  the  meta-queries  used  in  (ll)oshi  et,  al.  EDDB),  where  a  meta-query  consults  a  domain 
expert  (who  is  assumed  to  have  access  to  the  true  POMDP  model)  for  the  optimal  action 
at  a  particular  time  step.  The  meta-queries  correspond  to  human-robot  interactions  in 
robotics  applications.  It  should  be  noted  that,  by  implementing  the  reward  re-computation 
in  RPR  online,  the  behavior  policy  in  each  iteration  simply  becomes  the  RPR  in  the  previous 
iteration,  in  which  case  the  use  of  an  external  policy  like  PBVI  is  eliminated. 

In  principle,  the  number  of  decision  states  (belief  regions)  \Z\  can  be  selected  by  maxi¬ 
mizing  the  marginal  empirical  value  Vg0('D^k^)  =  f  V ('D^K'))Go(Q)dQ  with  respect  to  \Z\, 
where  an  approximation  of  V  (pW)  can  be  found  by  the  VB  algorithm  in  Table  0.  Because 
the  MV  does  not  employ  a  prior,  we  make  a  nominal  prior  G'o(O)  by  letting  it  take  the  form 
of  (E3)  but  with  all  hyper-parameters  uniformly  set  to  one.  This  leads  to  G'o(0)  =  Cmv, 
where  Cmv  is  a  normalization  constant.  Therefore  maximization  of  I ^go(P^0)  is  equivalent 
to  maximization  of  f  V(T>(K);&)dQ,  which  serves  as  an  evidence  of  how  good  the  choice 
of  \Z\  fits  to  the  episodes  in  terms  of  empirical  value.  According  to  the  Occam  Razor 
principle  ( Beal  EDEE3),  the  minimum  \Z\  fitting  the  episodes  has  the  best  generalization.  In 
practice,  letting  \Z\  be  a  multiple  of  the  number  of  actions  is  usually  a  good  choice  (here 
|^|  =4x5  =  20)  and  we  find  that  the  performance  of  the  RPR  is  quite  robust  to  the 
choice  of  \Z\.  This  may  be  attributed  to  the  fact  that  learning  of  the  RPR  is  a  process 
of  allocating  counts  to  the  decision  states  —  when  more  decision  states  are  included,  they 
simply  share  the  counts  that  otherwise  would  have  been  allocated  to  a  single  decision  state. 
Provided  the  sharing  of  counts  is  consistent  among  //,  vr,  and  W,  the  policy  will  not  change 
much. 

The  performance  of  the  RPR  is  compared  against  EM-PBVI,  the  method  that  first  learns 
a  predictive  model  as  in  i  ( Iinsnian  T322)  and  then  learns  the  policy  based  on  the  predictive 
model.  Here  the  predictive  model  is  a  POMDP  learned  by  expectation  maximization  (EM) 
based  on  V (-K'>  and  the  PBVI  (Pinean  et  al.  HIE!)  is  employed  to  find  the  policy  given  the 
POMDP.  To  examine  the  effect  of  the  behavior  policy  n  on  the  RPR’s  performance,  we 
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consider  three  different  IPs,  which  respectively  have  a  probability  pquery  =  5%,  30%,  50%  of 
choosing  the  actions  suggested  by  PBVI,  where  pquery  corresponds  to  the  rate  of  meta-query 
in  (I  )oshLat,aJ.l  121 )(  in)  .  The  episodes  used  to  learn  EM-PBVI  are  collected  by  the  behavior 
policy  with  pquery  =  50%,  which  is  the  highest  query  rate  considered  here.  Therefore  the 
experiments  are  biased  favorably  towards  the  EM-PBVI,  in  terms  of  the  number  of  expert- 
suggested  actions  that  are  employed  to  generate  the  training  episodes. 

The  performance  of  each  RPR,  as  well  as  that  of  EM-PBVI,  is  evaluated  on  Hallway2  by 
following  the  standard  testing  procedure  as  set  up  in  ( i  .it  1  man  et  hi.  H2H5) .  For  each  policy, 
a  total  of  251  independent  trials  are  performed  and  each  trial  is  terminated  when  either  the 
goal  is  reached  or  a  maximum  budget  of  251  steps  is  consumed.  Three  performance  measures 
are  computed  based  on  the  251  trials:  (a)  the  discounted  accumulative  reward  (i.e.,  the  sum 
of  exponentially  discounted  rewards  received  over  the  Nte  <  251  steps)  averaged  over  the 
251  trials;  (b)  the  goal  rate,  i.e.,  the  percentage  of  the  251  trials  in  which  the  agent  has 
reached  the  goal;  (c)  the  number  of  steps  that  the  agent  has  actually  taken,  averaged  over 
the  251  trials. 

The  results  on  Hallway2  are  summarized  in  Figure  □,  where  we  present  each  of  the 
three  performance  measures  plus  the  learning  time,  as  a  function  of  log10  of  the  number  of 
episodes  K  used  in  the  learning.  The  four  curves  in  each  figure  correspond  to  the  EM-PBVI, 
and  the  three  RPRs  with  a  rate  of  PBVI  query  5%,  30%,  and  50%,  respectively.  Each  curve 
results  from  an  average  over  20  independently  generated  T>^K^  and  the  error  bars  show  the 
standard  deviations.  For  simplicity,  the  error  bars  are  shown  only  for  the  RPR  with  a  50% 
query  rate. 

As  shown  in  Figure  □  the  performance  of  the  RPR  improves  as  the  number  of  episodes 
K  used  to  learn  it  increases,  regardless  of  the  behavior  policy  II.  As  recalled  from  Theorem 
0,  the  empirical  value  function  V(T> 0)  approaches  the  exact  value  function  as  K  goes  to 
infinity.  Assuming  the  RPR  has  enough  memory  (decision  states)  and  the  algorithm  finds 
the  global  maxima,  the  RPR  will  approach  the  optimal  policy  as  K  increases.  Therefore, 
Figure  □  serves  as  an  experimental  verification  of  Theorem  0.  The  CPU  time  shown  in 
Figure  |l(d)  is  exponential  in  log10  K  or,  equivalently,  is  linear  in  K.  The  linear  time  is 
consistent  with  the  complexity  analysis  in  Section  I  I  ..'1-21. 

The  error  bars  of  goal  rate  exhibits  quick  shrinkage  with  K  and  those  of  the  median 
number  of  steps  also  shrinks  relatively  fast.  In  contrast,  the  discounted  accumulative  reward 
has  a  very  slow  shrinkage  rate  for  its  error  bars.  The  different  shrinkage  rates  show  that 
it  is  much  easier  to  reach  the  goal  within  the  prescribed  number  of  steps  (251  here)  than 
to  reach  the  goal  in  relatively  less  steps.  Note  that,  when  the  goal  is  reached  at  the  t- th 
step,  the  number  of  steps  is  t  but  the  discounted  accumulative  reward  is  7_trgoai,  where 
v goal  is  the  reward  of  entering  the  goal  state.  The  exponential  discounting  explains  the 
difference  between  the  number  of  steps  and  the  discounted  accumulative  reward  regarding 
the  shrinkage  rates  of  error  bars. 

A  comparison  of  the  three  RPR  curves  in  Figure  □  shows  that  the  rate  at  which  the 
behavior  policy  II  uses  or  queries  PBVI  influences  the  RPR’s  performance  and  the  influence 
depends  on  K.  When  K  is  small,  increasing  the  query  rate  significantly  improves  the  per¬ 
formance;  whereas,  when  K  gets  larger,  the  influence  decreases  until  it  eventually  vanishes. 
The  decreased  influence  is  most  easily  seen  between  the  curves  of  30%  and  50%  query  rates. 
To  make  the  performance  not  degrade  when  the  query  rate  decreases  to  as  low  as  5%,  a 
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Hallway2  Hallway2 


Figure  1:  Performance  comparison  on  the  Hallway2  problem.  The  horizontal  axis  is  log10  of  the 
number  of  episodes  K  used  in  learning  the  RPR.  The  horizontal  axis  in  each  sub-figure 
is  (a)  Goal  rate  (b)  Discounted  accumulative  reward  (c)  Number  of  steps  to  reach  the 
goal  (d)  Time  in  seconds  for  learning  the  RPR.  The  four  curves  in  each  figure  correspond 
to  the  EM-PBVI  and  the  RPR  based  on  a  behavior  policy  II  that  queries  PBVI  with  a 
probability  pquery  =  5%,  30%,  50%,  respectively.  The  EM-PBVI  employs  EM  to  learn  a 
POMDP  model  based  on  the  episodes  collected  by  II  with  pquery  =  50%  and  then  uses 
the  PBVI  (Pineau  et  a, 1 .1 121 )( )31)  to  find  the  optimal  policy  based  on  the  learned  POMDP. 
Each  curve  results  from  an  average  over  20  independent  runs  and,  for  simplicity,  the  error 
bars  are  shown  only  for  the  RPR  with  a  50%  query  rate.  The  performance  measures  in 
(a)-(c)  are  explained  in  greater  detail  in  text. 


much  larger  K  may  be  required.  These  experimental  results  confirm  that  random  actions 
can  accomplish  a  good  exploration  of  available  rewards  (the  goal  state  here)  by  collecting 
a  large  number  of  (lengthy)  episodes  and  the  RPR  learned  from  these  episodes  perform 
competitively.  With  a  small  number  of  episodes,  however,  random  actions  achieve  limited 
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exploration  and  the  resulting  RPR  performs  poorly.  In  the  latter  case,  queries  to  experts 
like  PBVI  plays  an  important  role  in  improving  the  exploration  and  the  RPR’s  performance. 

It  is  also  seen  from  Figure  □  that  the  performance  of  EM-PBVI  is  not  satisfactory  and 
grows  slowly  with  I\ .  The  poor  performance  is  strongly  related  to  insufficient  exploration 
of  the  environment  by  the  limited  episodes.  For  EM-PBVI,  the  required  amount  of  episodes 
is  more  demanding  because  the  initial  objective  is  to  build  a  POMDP  model  instead  of 
learning  a  policy.  This  is  because  policy  learning  is  concerned  with  exploring  the  reward 
structure  but  building  a  POMDP  requires  exploration  of  all  aspects  of  the  environment. 
This  demonstrates  the  drawback  of  methods  that  rely  on  learning  an  intervening  POMDP 
model,  with  which  a  policy  is  designed  subsequently. 

6.2  Performance  of  RPR  in  Multi-task  Reinforcement  Learning  (MTRL) 

6.2.1  Maze  Navigation 

Problem  Description  In  this  problem,  there  are  M  =  10  environments  and  each  envi¬ 
ronment  is  a  grid- world,  i.e. ,  an  array  of  rectangular  cells.  Of  the  ten  environments,  three 
are  distinct  and  are  shown  in  Figure  □,  the  remaining  are  duplicated  from  the  three  distinct 
ones.  Specifically,  the  first  three  environments  are  duplicated  from  the  first  distinct  one,  the 
following  three  environments  are  duplicated  from  the  second  distinct  one,  and  the  last  four 
environments  are  duplicates  from  the  third  distinct  one.  We  assume  10  sets  of  episodes, 
with  the  m-th  set  collected  from  the  m-th  environment. 

In  each  of  the  distinct  environments  shown  in  Figure  □,  the  agent  can  take  five  actions 
{move  forward ,  move  backward ,  move  left,  move  right,  stay}.  In  each  cell  of  the  grid- world 
environments,  the  agent  can  only  observe  the  openness  of  the  cell  in  the  four  directions. 
The  agent  then  has  a  total  of  16  possible  observations  indicating  the  24  =  16  different 
combinations  of  the  openness  of  a  cell  in  the  four  orientations.  The  actions  (except  the 
action  stay )  taken  by  the  agent  are  not  accurate  and  have  some  noise.  The  probability  of 
arriving  at  the  correct  cell  by  taking  a  move  action  is  0.7  and  the  probability  of  arriving  at 
other  neighboring  cells  is  0.3.  The  perception  is  noisy,  with  a  probability  0.8  of  correctly 
observing  the  openness  and  the  probability  0.2  of  making  a  mistaken  observation.  The 
agent  receives  a  unit  reward  when  the  goal  (indicated  by  a  basket  of  food  in  the  figures) 
is  reached  and  zero  reward  otherwise.  The  agent  does  not  know  the  model  of  any  of  the 
environments  but  only  has  access  to  the  episodes,  i.e.,  sequences  of  actions,  observations, 
and  rewards. 

Algorithm  Learning  and  Evaluation  For  each  environment  m  =  1,  2,  •  •  •  ,  10,  there  is 
a  set  of  I\  episodes  T>m  \  collected  by  simulating  the  agent-environment  interaction  using 
the  models  described  above  and  a  behavior  policy  II  that  the  agent  follows  to  determine 
his  actions.  The  behavior  policy  is  the  semi-random  policy  described  in  Section  EH  with  a 
probability  pquery  =  0.5  of  taking  the  actions  suggested  by  PBVI. 

Reinforcement  learning  (RL)  based  on  the  ten  sets  of  episodes  leads  to  ten 

RPRs,  each  associated  with  one  of  the  ten  environments.  We  consider  three  paradigms  of 
learning:  the  MTRL  in  which  the  Gibbs-variational  algorithm  in  Table  □  is  applied  to  the 
ten  sets  of  episodes  jointly,  the  STRL  in  which  the  MV  algorithm  in  Table  □  is  applied  to 
each  of  the  ten  episode  sets  separately,  and  pooling  in  which  the  MV  algorithm  is  applied 
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Figure  2:  The  three  distinct  grid-world  environments,  where  the  goal  is  designated  by  the  basket 
of  food,  each  block  indicates  a  cell  in  the  grid  world,  and  the  two  gray  cells  are  occupied 
by  a  wall.  The  red  dashed  lines  in  (a)  and  (c)  indicate  the  similar  parts  in  the  two 
environments.  The  agent  locates  himself  by  observing  the  openness  of  a  cell  in  the  four 
orientations.  Both  the  motion  and  the  observation  are  noisy. 


to  the  union  of  the  ten  episode  sets.  The  number  of  decision  states  is  chosen  as  \Z\  =  6 
for  all  environments  and  all  learning  paradigms.  Other  larger  \Z\  give  similar  results  and, 
if  desired,  the  selection  of  decision  states  can  be  accomplished  by  maximizing  the  marginal 
empirical  value  with  respect  to  \Z\,  as  discussed  above. 

The  RPR  policy  learned  by  any  paradigm  for  any  environment  is  evaluated  by  executing 
the  policy  1000  times  independently,  each  time  starting  randomly  from  a  grid  cell  in  the 
environment  and  taking  a  maximum  of  15  steps.  The  performance  of  the  policy  is  evaluated 
by  two  performance  measures:  (a)  the  average  success  rate  at  which  the  agent  reaches  the 
goal  within  15  steps,  and  (b)  the  average  number  of  steps  that  the  agent  takes  to  reach  the 
goal.  When  the  agent  does  not  reach  the  goal  within  15  steps,  the  number  of  steps  is  15. 
Each  performance  measure  is  computed  from  the  1000  instances  of  policy  execution,  and  is 
averaged  over  20  independent  trials. 
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Figure  3:  Comparison  of  MTRL,  STRL,  and  pooling  on  the  problem  of  multiple  stochastic  envi¬ 
ronments  summarized  in  Figure  □.  (a)  Average  success  rate  for  the  agent  to  reach  the 
goal  within  15  steps,  (b)  Average  step  for  the  agent  reaching  the  target,  (c)  Average 
success  rate  for  the  agent  with  the  horizontal  axis  in  log  scale. (d)  Average  step  with  the 
horizontal  axis  in  log  scale. 


We  examine  the  performance  of  each  learning  paradigm  for  various  choices  of  K ,  the 
number  of  episodes  per  environment.  Specifically  we  consider  16  different  choices:  K  = 
3, 4,  5,  6,  7, 8, 9, 10, 11, 12,  24,  60, 120,  240.  The  performances  of  the  three  learning  paradigms, 
averaged  over  20  independent  trials,  are  plotted  in  Figure  E3I  as  a  function  of  K.  Figures 
13(c)|  and  |3(d)|  are  respectively  duplicates  of  Figures  |3(a)|  and  |3( b)|,  with  the  horizontal 
axis  displayed  in  a  logarithmic  scale.  By  (E3),  the  choice  of  the  precision  parameter  a  in 
Dirichlet  process  influences  the  probability  of  sampling  a  new  cluster;  it  hence  influences  the 
resulting  number  of  distinct  RPR  parameters  0.  According  to  (  Westi  rfR92) ,  the  choice  of  a 
is  governed  by  the  posterior  p(a\K,  N )  oc  p(N\K,  a)p(a),  where  N  is  the  number  of  clusters 
updated  in  the  most  recent  iteration  of  the  Gibbs-variational  algorithm.  One  may  choose  a 
by  sampling  from  the  posterior  or  finding  the  mean.  When  I\  is  large  and  N  <C  K  and  the 
prior  p(a)  is  a  Gamma  distribution,  the  posterior  p(a\K,N)  is  approximately  a  Gamma 
distribution  with  the  mean  E(a)  =  0(N  log(A")).  For  the  different  choices  of  K  considered 
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above,  we  choose  a  =  3 n,  with  n  =  2,  3, . . . ,  15  respectively.  These  choices  are  based  on 
approximations  of  E(a)  obtained  by  fixing  N  at  an  initial  guess  N  =  8.  We  find  that  the 
results  are  relatively  robust  to  the  initial  guess  provided  the  logarithmic  dependence  on  I\  is 
employed.  The  density  of  the  DP  base  Go  is  of  the  form  in  (E2I),  with  all  hyper-parameters 
set  to  one,  making  the  base  non-informative. 

Figures  |3(a)|  and  |3( b)|  show  that  the  performance  of  MTRL  is  generally  much  better 
than  that  of  STRL  and  pooling.  The  improvement  is  attributed  to  the  fact  that  MTRL 
automatically  identifies  and  enforces  appropriate  sharing  among  the  ten  environments  to 
ensure  that  information  transfer  is  positive.  The  improvement  over  STRL  indicates  that 
the  number  of  episodes  required  for  finding  the  correct  sharing  is  generally  smaller  that  that 
required  for  finding  the  correct  policies. 

The  identification  of  appropriate  sharing  is  based  on  information  from  the  episodes. 
When  the  number  of  episodes  is  very  small  (say,  less  than  25  in  the  examples  here),  the 
sharing  found  by  MTRL  may  not  be  accurate;  in  this  case,  simply  pooling  the  episodes 
across  all  ten  environments  may  be  a  more  reasonable  alternative.  When  the  number  of 
episodes  increases,  however,  pooling  begins  to  show  disadvantages  since  the  environments 
are  not  all  the  same  and  forcing  them  to  share  leads  to  negative  information  transfer.  The 
seemingly  degraded  performance  of  pooling  at  the  first  two  points  in  Figure  |3(c)|  may  not 
be  accurate  since  the  results  have  large  variations  when  the  episodes  are  extremely  scarce; 
much  more  Monte  Carlo  runs  may  be  required  to  obtain  accurate  results  in  these  cases. 

The  performance  of  STRL  is  poor  when  the  number  of  episodes  is  small,  because  a 
small  set  of  episodes  do  not  provide  enough  information  for  learning  a  good  RPR.  However, 
the  STRL  performance  improves  significantly  with  the  increase  of  episodes,  which  whittles 
down  the  advantage  brought  about  by  information  transfer  and  allows  STRL  to  eventually 
catch  up  with  MTRL  in  performance. 


Analysis  of  the  Sharing  Mechanism  We  investigate  the  sharing  mechanism  of  the 
MTRL  by  plotting  Hinton  diagrams.  The  Hinton  diagram  (iHinton  and  Sejnowskil  USER)  is 
a  quantitative  way  of  displaying  the  elements  of  a  data  matrix.  Each  element  is  represented 
by  a  square  whose  size  is  proportional  to  the  magnitude.  In  our  case  here,  the  data  matrix 
is  the  between-task  similarity  matrix 


learned  by  the  MTRL;  it  is  defined  as 
follows:  the  between-task  similarity  matrix  is  a  symmetric  matrix  of  size  M  x  M  (where  M 
denotes  the  number  of  tasks  and  M  =  10  in  the  present  experiment),  the  th  element 
measuring  the  frequency  that  task  i  and  task  j  belong  to  the  same  cluster  (i.e. ,  they  result 
in  the  same  distinct  RPR).  In  order  to  avoid  the  bias  due  to  any  specific  set  of  episodes, 
we  perform  20  independent  trials  and  average  the  similarity  matrix  over  the  20  trials.  In 
each  trial,  if  tasks  i  and  j  belong  to  one  cluster  upon  convergence  of  the  Gibbs-variational 
algorithm,  we  add  one  at  (i,j)  and  ( j,i )  of  the  matrix.  We  compute  the  between-task 
similarity  matrices  when  the  number  of  episodes  is  respectively  K  =  3, 10, 60, 120,  which 
represent  the  typical  sharing  patterns  inferred  by  the  MTRL  for  the  present  maze  navigation 
problem.  The  Hinton  diagrams  for  these  four  matrices  are  plotted  in  Figure  EH. 

The  Hinton  diagrams  in  Figures  [4(a)  and  4(b)|  show  that  when  the  number  of  episodes 
is  small,  environments  1,  2,  3,  7,  8,  9,  10  have  a  higher  frequency  of  sharing  the  same 
RPR.  This  sharing  can  be  intuitively  justified  by  first  recalling  that  these  environments  are 
duplicates  of  Figures  |2(a)|  and  |2(c)|,  and  then  noting  that  the  parts  circumscribed  by  red 
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Figure  4:  Hinton  diagrams  of  the  between-task  similarity  matrix  inferred  by  the  MTRL  for  the 
problem  of  multiple  stochastic  environments  □.  The  number  of  episodes  per  environment 
is  (a)  3  (b)  10  (c)  60  (d)  120. 


dashed  lines  in  Figures  |2(a)|  and  2(c)  are  quite  similar.  Meanwhile  the  Hinton  diagrams  also 
show  a  weak  sharing  between  environments  4,5,6,7,8,9,10,  which  are  duplicates  of  Figures 
|2(b)|  and  |2(c)|.  This  is  probably  because  the  episodes  are  very  few  at  this  stage,  and  pooling 
episodes  from  environments  that  are  not  so  relevant  to  each  other  could  also  be  helpful. 
This  explains  why,  in  Figure  |3(a)|,  the  performance  of  pooling  is  as  good  as  that  of  the 
MTRL  when  the  number  of  episodes  is  small. 


As  the  number  of  episodes  progressively  increases,  the  ability  of  MTRL  to  identify 
the  correct  sharing  improves  and,  as  seen  in  Figures  |4(  b)|  and  |4(c)|,  only  those  episodes 
from  relevant  environments  are  pooled  together  to  enhance  the  performance  —  a  simple 
pooling  of  all  episodes  together  deteriorates  the  performance.  This  explains  why  the  MTRL 
outperforms  pooling  with  the  increase  of  episodes.  Meanwhile,  the  STRL  does  not  perform 
well  for  limited  episodes.  However,  when  there  are  more  episodes  from  each  environment, 
the  STRL  learns  and  performs  steadily  better  until  it  outperforms  the  pooling  and  becomes 
comparable  to  the  MTRL. 
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Figure  5:  Displacements  of  goal  state  in  the  six  environments  considered  in  Maze  Navigation  2. 

Each  environment  is  a  variant  of  the  benchmark  Hallway2  (hittma.n  et,  al.  II !)!).:tl  with 
the  goal  displaced  to  a  new  grid  cell  designated  by  a  numbered  circle  and  the  number 
indicating  the  index  of  the  environment.  The  unique  observation  associated  with  the  goal 
is  also  changed  accordingly  in  each  variant. 


Hallway2-average-over-all-tasks  Hallway2-task1 


(a)  (b) 

Figure  6:  Performance  comparison  on  the  six  environments  modified  from  the  benchmark  problem 
Hallway2  (Liftman  et  al  J  IBETTil).  (a)  Discounted  accumulative  reward  averaged  over  the 
six  environments  (b)  Discounted  accumulative  reward  in  the  first  environment,  which  is 
the  original  Hallway2. 


6.2.2  Maze  Navigation  2 

We  consider  six  environments,  each  of  which  results  from  modifying  the  benchmark  maze 
problem  Hallway2  (hitfman.  et  al-  1333)  in  the  following  manner.  First  the  goal  state  is 
displaced  to  a  new  grid  cell  and  then  the  unique  observation  associated  with  the  goal  is 
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changed  accordingly.  For  each  environment  the  location  of  the  goal  state  is  shown  in  Figure 
0  as  a  numbered  circle,  where  the  number  indicates  the  index  of  the  environment.  Of  the 
six  environments  the  first  one  is  the  original  Hallway2.  It  is  seen  that  environments  1,2,3 
have  their  goal  states  near  the  lower  right  corner  while  environments  4,  5,  6  have  their  goal 
states  near  the  upper  left  corner.  Thus  we  expect  that  the  environments  are  grouped  into 
two  clusters. 

For  each  environment,  a  set  of  K  episodes  are  collected  by  following  a  semi-random 
behavior  policy  II  that  executes  the  actions  suggested  by  PBVI  with  probability  pquery  = 
0.3.  As  in  Section  ELZ  ll  three  versions  of  RPR  are  obtained  for  each  environment,  based 
respectively  on  three  paradigms,  namely  MTRL,  STRL,  and  pooling.  The  a  is  chosen  as 
5  log(iv )  with  5  corresponding  to  an  initial  guess  of  N  and  Go  is  of  the  form  of  (E2)  with  all 
hyper-parameters  close  to  one  (thus  the  prior  is  non-informative) .  The  number  of  decision 
states  is  \Z\  =  20  as  in  Section  10. 1 .  ll.  The  performance  comparison,  in  terms  of  discounted 
accumulative  reward  and  averaged  over  20  independent  trials,  is  summarized  in  Figure  0. 
as  a  function  of  the  number  of  episodes  per  environment. 

Figure  0(a)  shows  that  the  MTRL  maintains  the  overall  best  performance  regardless  of 
the  number  of  episodes  I\.  The  STRL  and  the  pooling  are  sensitive  to  K ,  with  the  pooling 
outperforming  the  STRL  when  K  <  540  but  outperformed  by  the  STRL  when  K  >  540. 
In  either  case,  however,  the  MTRL  performs  no  worse  than  both.  The  MTRL  consistently 
performs  well  because  it  adaptively  adjusts  the  sharing  among  tasks  as  K  changes,  such 
that  the  sharing  is  appropriate  regardless  of  K.  The  adaptive  sharing  can  be  seen  from 
Figure  □,  which  shows  the  Hinton  diagram  of  the  between-task  similarity  matrix  learned  by 
the  MTRL,  for  various  instances  of  K.  When  I\  is  small  there  is  a  strong  sharing  among 
all  tasks,  in  which  case  the  MTRL  reduces  to  the  pooling,  explaining  why  the  MTRL 
performs  similar  to  the  pooling  when  K  <  250.  When  K  is  large,  the  sharing  becomes 
weak  between  any  two  tasks,  which  reduces  the  MTRL  to  the  STRL,  explaining  why  the 
two  perform  similarly  when  K  >  700.  As  the  number  of  episodes  approaches  to  K  =  540, 
the  performances  of  the  STRL  and  the  pooling  tend  to  become  closer  and  more  comparable 
until  they  eventually  meet  at  K  =  540.  The  range  of  K  near  this  intersection  is  also  the 
area  in  which  the  MTRL  yields  the  most  significant  margin  of  improvements  over  the  STRL 
and  the  pooling.  This  is  so  because,  for  this  range  of  K,  the  correct  between-task  sharing 
is  complicated  (as  shown  in  Figure  13(b)),  which  can  be  accurately  characterized  by  the  fine 
sharing  patterns  provided  by  the  MTRL,  but  cannot  be  characterized  by  the  pooling  or  the 
STRL. 

Figure  0(a)  plots  the  overall  performance  comparison  taking  all  environments  into  con¬ 
sideration.  As  an  example  of  the  performances  in  individual  environments,  we  show  in 
Figure  0(a)  the  performance  comparison  in  the  first  environment,  which  is  also  the  original 
Hallway2  problem.  The  change  of  magnitude  in  the  vertical  axis  is  due  to  the  fact  the  first 
environment  has  the  goal  in  a  room  (instead  in  the  hallway),  which  makes  it  more  difficult 
to  reach  the  goal. 

6.2.3  Multi- aspect  Classification 

Problem  Description  Multi-aspect  classification  refers  to  the  problem  of  identifying 
the  class  label  of  an  object  using  observations  from  a  sequence  of  viewing  angles.  This 
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Figure  7:  Hinton  diagrams  of  the  between-task  similarity  matrix  learned  by  the  MTRL  from  the 
six  environments  modified  from  the  benchmark  problem  Hallway2  :  I  .in  miiii  el  aL  EHo) . 
The  number  of  episodes  is  is  (a)  K  =  40  (b)  K  =  540  (c)  K  =  810. 


Figure  8: 


A  typical  configuration  of  multi-aspect  classification  of  underwater  objects. 


problem  is  generally  found  in  applications  where  the  object  responds  to  interrogations  in  a 
angle-dependent  manner.  In  such  cases,  an  observation  at  a  single  viewing  angle  carries  the 
information  specific  to  only  that  angle  and  the  nearby  angles,  and  one  requires  observations 
at  many  viewing  angles  to  fully  characterize  the  object. 

More  importantly,  the  observations  at  different  viewing  angles  are  not  independent  of 
each  other,  and  are  correlated  in  a  complicated  and  yet  useful  way.  The  specific  form  of  the 
angle-dependency  is  dictated  by  the  physical  constitution  of  the  object  as  well  as  the  nature 
of  the  interrogator  —  typically  electromagnetic  or  acoustic  waves.  By  carefully  collecting 
and  processing  observations  sampled  at  densely  spaced  angles,  it  is  possible  to  form  an 
image,  based  on  which  classification  can  be  performed.  An  alternative  approach  is  to  treat 
the  observations  as  a  sequence  and  characterize  the  angle-dependency  by  a  hidden  Markov 
model  (HMM)  ([Unhide  Ft  all liDDll) . 

In  this  section  we  consider  multi-aspect  classification  of  underwater  objects  based  on 
acoustic  responses  of  the  objects.  Figure  0  shows  a  typical  configuration  of  the  problem. 
The  cylinder  represents  an  underwater  object  of  unknown  identity  y.  We  assume  that  the 
object  belongs  to  a  finite  set  of  categories  y  (i.e. ,  y  €  y).  The  agent  aims  to  discover  the 
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Figure  9:  Frequency-domain  acoustic  responses  of  the  five  underwater  objects  (a)  Target-1  (b) 
Target-2  (c)  Target-3  (d)  Target-4  (e)  Clutter. 


unknown  y  by  moving  around  the  object  and  interrogating  it  at  multiple  viewing  angles 
<p.  We  assume  the  angular  motion  is  one-dimensional,  i.e.,  the  agent  moves  clockwise  or 
counterclockwise  on  the  page,  but  does  not  move  out  of  the  page.  The  set  of  angles  that 
can  be  occupied  by  the  agent  is  then  [0°,  360°],  which  in  practice  is  discretized  into  a  finite 
number  of  angular  sectors  denoted  by  Sv. 

In  the  HMM  approach  (Rnnkle  et  alJ  119991),  constitutes  the  set  of  hidden  states, 
and  the  state  transitions  can  be  computed  using  simple  geometry  iUunklo  cl  af]  1999) , 
under  the  assumptions  that  each  time  the  agent  moves  by  a  constant  angular  step  and 
that  the  specific  angles  occupied  by  the  agent  are  uniformly  distributed  within  any  given 
state.  Refinement  of  state  transitions  and  estimation  of  state  emissions  can  be  achieved  by 
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maximizing  the  likelihood  function  constructed  from  the  training  sequences.  In  the  training 
phase,  one  trains  an  HMM  for  each  y  €  y.  For  an  unknown  object,  one  collects  a  sequence 
of  observations  (sensor  data)  and  submit  it  to  the  HMM  for  each  y  E  T;  the  y  yielding  the 
maximum  likelihood  is  then  declared  to  be  the  identity  of  the  unknown  object.  Obviously 
the  agent  must  follow  a  common  protocol  to  collect  the  data  sequences  in  both  the  training 
and  test  phases,  to  ensure  that  their  statistics  are  consistent.  Since  such  a  protocol  is  not 
part  of  the  HMMs,  a  question  arises  as  to  how  to  specify  the  protocol. 

From  the  perspective  of  sequential  decision-making,  multi-aspect  classification  can  be 
formulated  as  a  reinforcement  learning  problem,  with  a  state  space  S  =  5^,  x  T,  where  x  is  a 
Cartesian  product.  Both  Sv  and  y  are  only  partially  observable  (through  sensor  data).  The 
RL  approach  possesses  several  conspicuous  advantages  over  the  HMM  approach.  First,  the 
sensor  data  are  now  collected  in  an  active  manner,  under  the  control  of  agent  actions.  When 
two  data  sequences  are  collected  by  following  the  same  policy  of  choosing  actions,  they  are 
automatically  ensured  to  be  consistent  in  statistics,  hence  there  is  no  need  to  specify  a 
separate  common  protocol  for  collecting  the  sequential  data.  Second,  unlike  maximizing 
the  data  likelihood  (under  a  given  data  collection  protocol),  the  agent  is  now  free  to  choose 
a  more  flexible  learning  objective  by  setting  an  appropriate  reward  structure.  Third,  unlike 
building  a  HMM  for  each  y  E  y,  the  different  categories  are  now  coalesced  into  a  single 
RPR  (details  are  presented  below),  making  the  RL  a  discriminative  approach  vis-a-via  the 
generative  HMM  approach. 

In  our  experiment,  there  are  a  total  of  five  objects,  four  of  them  are  targets  of  interest 
and  one  of  them  represents  the  clutter.  The  frequency-domain  acoustic  responses  of  these 
objects  are  shown  in  Figure  0,  for  a  full  coverage  of  angles  from  0°  to  360°;  the  data  are  real 
measurements  as  described  in  i  llimkle  at  ah  1111)1)1) .  We  aim  to  distinguish  each  target  from 
clutter  and  this  gives  four  tasks,  where  task  i  is  defined  by  the  problem  of  distinguishing 
target-i  from  clutter,  i  =  1,2,  3, 4,  and  the  targets  and  clutter  are  as  shown  in  Figure  0. 
Each  task  is  a  multi-aspect  classification  problem0.  From  the  data  in  Figure  0,  targets  1 
and  2  have  similar  angle-dependent  scattering  phenomena,  and  therefore  Tasks  1  and  2 
are  expected  to  be  related.  Targets  3  and  4  also  appear  to  have  similar  angle-dependent 
scattering  characteristics,  and  therefore  Tasks  3  and  4  are  expected  to  also  be  related.  In 
fact,  although  the  target  details  are  too  involved  to  detail  here,  targets  1  and  2  are  both 
of  a  cylindrical  form  (like  those  in  (IRunkle  eh_af.l  0303)),  while  targets  3  and  4  are  more 
irregular  in  shape. 

The  RPR  for  Multi-aspect  Classification  In  applying  the  RPR  to  multi-aspect  clas¬ 
sification,  our  approach  is  distinct  from  an  HMM  construction  (Rimkie..ef,.^al.  I !)!)!)  ;  in  two 
important  respects.  First,  the  RPR  is  a  control  model  and  it  aims  to  optimize  the  value 
function,  instead  of  the  likelihood  function.  Since  the  RPR  takes  into  account  a  reward 
structure,  it  can  be  more  flexible  in  specifying  the  learning  objective.  Second,  the  RPR 
embraces  all  objects  in  the  same  representation,  instead  of  having  a  separate  model  for  each 
individual  object.  As  a  result,  it  is  a  discriminative  model  instead  of  a  generative  model 
(this  may  be  viewed  as  a  discriminative  extension  of  the  traditional  generative  HMM). 


6.  Upon  publication,  all  data  from  this  study  will  be  put  on  a  web  site,  for  others  to  utilize  in  comparative 
studies. 
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The  RPR  does  not  manipulate  the  angular  states  —  it  works  directly  with  observations. 
Since  classification  is  treated  as  a  control  problem  in  the  RPR,  we  need  two  extra  compo¬ 
nents,  actions  and  rewards,  to  complete  the  specification.  We  consider  four  actions,  i.e. , 
A  ={declare  as  target ,  declare  as  clutter,  move  clockwise  and  sense,  move  counterclockwise 
and  sense}.  When  the  agent  takes  action  move  clockwise  and  sense,  it  moves  5°  clockwise 
and  collects  an  observation;  when  the  agent  takes  action  move  counterclockwise  and  sense, 
it  moves  5°  counterclockwise  and  collects  an  observation.  The  reward  structure  is  specified 
as  follows.  A  correct  declaration  receives  a  reward  of  5  units,  a  false  declaration  receives 
a  reward  of  —5,  and  the  actions  move  clockwise  and  sense  and  move  counterclockwise  and 
sense  each  receives  a  reward  of  zero  units.  The  objective,  therefore,  is  to  correctly  classify 
the  target  with  the  minimal  number  of  sensing  actions. 

The  episodes  used  in  learning  the  RPR  consist  of  a  number  of  observation  sequences,  each 
observation  is  associated  with  the  action  move  clockwise  and  sense  or  move  counterclockwise 
and  sense  and  the  terminal  action  in  each  episode  is  the  correct  declaration.  The  correction 
declaration  is  available  because  the  episodes  in  this  problem  are  the  training  data  in  standard 
classification,  hence  the  ground  truth  of  class  labels  is  known.  Note  that  the  training 
episodes  always  terminate  with  a  correct  declaration,  thus  the  agent  never  actually  receives 
the  penalty  —5  during  the  training  phase.  Alternatively,  one  may  split  each  episode  into 
two,  respectively  terminated  with  the  correct  and  the  false  declaration.  Recall  the  false 
declaration  receives  the  minimum  reward  which,  after  an  offset  of  5  to  make  all  rewards 
non-negative,  is  converted  to  zero.  Since  a  zero  reward  received  at  the  end  of  an  episode 
nullifies  the  entire  episode,  such  an  alternative  is  equivalent  to  excluding  the  penalized 
episodes. 


Classification  Results  The  raw  data  are  shown  in  Figure  0,  for  the  five  objects  we  are 
considering.  Each  datum  is  the  response  of  an  object  measured  at  a  particular  angle  and 
the  data  set  for  an  object  consists  of  measurements  collected  at  0°,  1°,  •  •  •  ,  359°.  Each  raw 
datum  is  converted  into  a  feature  vector  using  matching  pursuit  !  McClure  and  (.'arinl  119971) , 
and  the  feature  vectors  are  further  discretized  by  vector  quantization  (iGersho  and  Gray! 
.11332)  to  produce  a  finite  code-book.  As  mentioned  earlier,  we  have  a  total  of  four  tasks, 
each  task  being  to  distinguish  each  of  the  four  respective  targets  from  the  clutter. 

Four  methods  are  compared:  the  MTRL,  the  STRL,  the  pooling,  and  the  hidden  Markov 
models  (HMM),  where  the  first  three  are  as  described  in  Section  fi.2.  II  and  the  last  one  is  the 
standard  hidden  Markov  model  (RabTneil  f'l  9891) .  The  four  methods  yield  four  corresponding 
agents,  each  following  the  policy  resulting  from  one  of  the  algorithms. 

When  the  agents  collect  episodes  during  the  training  phase,  they  start  from  angles  that 
are  uniformly  drawn  from  {1°,  2°,  •  •  • ,  360°}.  For  each  starting  angle,  two  episodes  are 
collected:  the  first  is  obtained  by  moving  clockwise  to  collect  an  observation  at  every  5° 
and  terminating  upon  the  10-th  observation,  and  the  other  is  the  same  as  the  first  but  the 
agent  moves  counterclockwise.  During  the  testing  phase,  both  the  RPR  agents  and  the 
HMM  agent  start  from  angles  uniformly  drawn  from  {1°,  2°,  •  •  • ,  360°};  however,  the  RPR 
agents  follow  one  of  the  three  policies  (resulting  respectively  from  the  MTRL,  the  STRL, 
and  the  pooling)  to  choose  an  action  from  A,  while  the  HMM  agent  collects  n  observations 
by  moving  consistently  clockwise  or  counterclockwise  (either  direction  is  chosen  with  a 
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probability  of  0.5)  and  then  makes  a  declaration,  where  n  is  adaptively  set  to  the  maximum 
of  the  numbers  of  observations  used  by  the  three  RPR  agents  starting  from  the  same  angle. 

Figure  DU  summarizes  the  performance  as  a  function  of  the  number  of  training  episodes 
K,  where  the  performance  is  evaluated  by  the  correct  classification  rate  as  well  as  the 
average  number  of  sensing  actions  (i.e. ,  the  average  number  of  observations  collected)  before 
a  declaration  is  made.  Each  point  in  the  figures  is  an  average  from  20  independent  trials. 


(a)  Classification  rate  (b)  Average  number  of  sensing  steps 


Figure  10:  Performance  comparison  on  multi- aspect  classification  of  underwater  targets  (a)  Av¬ 
erage  classification  rate  as  a  function  of  the  number  of  training  episodes  per  task  (b) 
Average  number  of  sensing  actions  (average  number  of  observations  collected)  before  a 
declaration  is  made,  as  a  function  of  the  number  of  training  episodes  per  task. 


It  is  seen  from  Figure  |10(a)|  that  the  MTRL  achieves  the  highest  classification  rate 
regardless  of  the  number  of  training  episodes  K.  The  pooling  performs  worse  than  the  STRL 
and  the  poor  performance  persists  even  when  K  is  small.  The  latter  is  in  contrast  with 
the  results  on  the  maze  navigation  problems  in  Sections  >  > .  A  1 1  and  l b.2.~J.  where  the  pooling 
performs  better  than  the  STRL  with  a  small  K.  The  reason  for  this  will  be  clear  below 
from  the  sharing- mechanism  analysis.  It  is  noted  that  all  three  RPR  algorithms  perform 
much  better  than  the  HMM,  demonstrating  the  superiority  of  discriminative  models  over 
generative  models  in  classification  problems. 

As  shown  by  Figure  10(b)|,  pooling  takes  the  least  number  of  sensing  actions,  which  may 
be  attributed  to  the  over-confidence  arising  from  an  abundant  set  of  training  data,  noting 
that  the  pooling  agent  learns  its  policy  by  using  the  episodes  accumulated  over  all  tasks. 
In  contrast,  the  STRL  agent  takes  the  most  number  of  actions.  Considering  that  the  STRL 
agent  bases  policy  learning  on  the  episodes  collected  from  a  single  task,  which  may  contain 
inadequate  information,  it  is  reasonable  that  the  STRL  agent  is  less  confident  and  would 
make  more  observations  before  coming  to  a  conclusion.  The  sensing  steps  taken  by  the 
MTRL  agent  lies  in  between,  since  it  relies  on  related  tasks,  but  not  all  tasks,  to  provide 
the  episodes  for  policy  learning. 

Analysis  of  the  Sharing  Mechanism  The  Hinton  diagram  of  the  between-task  simi¬ 
larity  matrix  is  shown  in  Figures  |1 1  ( a ) ,  11(b),  1 1  (c)|,  |1 1  ( d )|,  for  the  cases  when  the  number 
of  training  episodes  K  is  equal  to  10,  30,  110,  170,  respectively. 
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Hinton  diagram  on  shell  data:  nrSeq  =  10 


Index  of  tasks 


Hinton  diagram  on  shell  data:  nrSeq  =  30 


Index  of  tasks 


(a) 


(b) 


Hinton  diagram  on  shell  data:  nrSeq  =  170 


Index  of  tasks 


(C)  (d) 


Figure  11:  Sharing  mechanism  for  multi-aspect  classification  of  underwater  targets.  Each  figure 
is  the  Hinton  diagram  of  the  between  similarity  matrix,  with  the  number  of  training 
episodes  per  task:  (a)  10  (b)  30  (c)  110  (d)  170. 


It  is  seen  that  the  sharing  patterns  are  dominated  by  two  clusters,  the  first  consisting 
of  Task  1  and  Task  2  and  the  second  consisting  of  Task  3  and  Task  4.  The  second  cluster 
remains  unchanged  regardless  of  K.  The  first  cluster  tends  to  break  when  K  =  30,  but  is 
resumed  later  on.  The  two  clusters  are  consistent  with  Figure  0  which  shows  that  targets  1 
and  2  are  similar  and  so  are  targets  3  and  4.  The  fact  the  two  clusters  are  persistent  through 
the  entire  range  of  K  implies  that  the  tasks  from  different  clusters  are  weakly  related  even 
when  the  episodes  are  scarce,  as  a  result  pooling  the  episodes  across  all  tasks  yields  poor 
policies.  This  explains  the  poor  performance  of  the  pooling  in  Figure  |10(a). 

To  understand  the  reason  why  the  cluster  of  tasks  1  and  2  is  less  stable,  one  need  delve 
into  some  details  of  the  targets.  Target  1  and  Target  2  both  have  a  cylindrical  shape  while 
Task  3  and  Task  4  are  more  irregular  in  shape.  Similar  geometry  puts  Targets  1  and  2  in  one 
cluster  and  Targets  3  and  4  in  another  cluster.  Moreover,  the  measurements  of  Targets  3 
and  4  are  more  noisy  than  the  measurements  of  Targets  1  and  2,  because  they  are  collected 
under  different  conditions.  The  low  signal  to  noise  ratio  (SNR)  increases  the  similarity 
between  Targets  3  and  4  since  their  distinctive  features  are  buried  in  the  noise.  The  more 
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noise-free  measurements  of  Target  1  and  2  yields  a  more  faithful  representation  of  these 
targets,  which  tends  to  magnify  their  differences  and  make  them  appear  less  similar. 

7.  Conclusions 

We  have  presented  a  multi-task  reinforcement  learning  (MTRL)  framework  for  partially 
observable  stochastic  environments.  To  our  knowledge,  this  is  the  first  framework  proposed 
for  MTRL  in  the  partially  observable  domain. 

A  key  element  in  our  MTRL  framework  is  the  regionalized  policy  representation  (RPR), 
which  yields  a  history-dependent  stochastic  policy  for  environments  characterized  by  a 
partially  observable  Markov  decision  process  (POMDP).  Learning  of  the  RPR  is  based  on 
episodic  experiences  collected  from  the  environment,  without  requiring  the  environment’s 
model.  We  have  developed  two  algorithms  for  learning  the  RPR,  one  based  on  maximum- 
value  estimation  and  the  other  based  on  the  variational  Bayesian  paradigm.  The  latter 
offers  the  ability  for  selecting  the  number  of  decision  states  based  on  the  Occam  Razor 
principle  and  the  possibility  of  transferring  experience  between  related  environments. 

Built  upon  the  basic  RPR,  the  proposed  MTRL  framework  consists  of  multiple  RPRs, 
each  for  an  environment,  coupled  by  a  common  Dirichlet  process  (DP)  that  is  used  to 
produce  the  nonparametric  prior  over  all  RPRs.  By  virtue  of  the  discreteness  of  the  non- 
parametric  prior,  the  environments  are  clustered  into  groups,  with  each  group  consisting 
of  a  subset  of  environments  that  are  related  in  some  manner.  The  number  of  groups  as 
well  as  the  associated  environments  are  automatically  identified,  and  the  experiences  are 
shared  among  the  related  environments  to  increase  their  respective  exploration.  A  hybrid 
Gibbs-variational  algorithm  is  presented  for  learning  multiple  RPRs  simultaneously  under 
the  unified  MTRL  framework,  based  on  selective  use  of  the  experiences  collected  across  all 
environments. 

Experimental  results  demonstrate  that  the  proposed  MTRL  consistently  yields  superior 
performance  regardless  of  the  amount  of  experiences  used  in  learning.  The  two  competi¬ 
tors,  one  based  on  single-task  reinforcement  learning  (STRL)  and  other  based  on  simple 
pooling,  are  shown  to  be  sensitive  to  the  amount  of  experiences.  The  superior  performance 
is  attributed  to  the  ability  of  the  MTRL  to  automatically  identify  useful  experiences  from 
related  environments  to  enhance  the  exploration.  The  MTRL  adaptively  adjusts  sharing 
patterns  to  offset  the  changes  in  the  experience  and  hence  has  addressed  the  problem  of 
how  to  positively  transfer  the  experience  from  one  environment  to  the  benefit  of  improving 
learning  in  another.  In  addition,  we  have  also  presented  experimental  results  on  bench¬ 
mark  problems  demonstrating  the  RPR  as  a  powerful  stand-alone  algorithm  for  single-task 
reinforcement  learning. 

The  work  presented  in  this  report  mainly  focuses  on  off-policy  batch  learning,  assuming 
the  learning  is  based  on  a  fixed  set  of  episodic  experiences  collected  by  following  an  external 
behavior  policy.  In  the  off-policy  batch  learning  mode,  the  policy  improvement  is  imple¬ 
mented  without  actually  re-interacting  with  the  environment;  instead  the  improvement  is 
implemented  through  virtual  “reward  re-computation”  (discussed  after  (EE)),  which  simu¬ 
lates  the  re-interaction  with  the  environment.  By  taking  reward  re-computation  out  of  the 
algorithm  and  implementing  it  via  real  re-interaction,  we  can  learn  the  RPRs  in  an  on-policy 
online  manner.  In  this  case,  the  need  for  an  external  behavior  policy  is  eliminated  and  the 
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previous  version  RPR  is  employed  as  the  behavior  policy.  In  the  next  phase  of  this  work,  we 
will  focus  on  on-policy  online  learning  of  RPRs  and  investigate  how  each  environment  can 
be  better  explored  via  multi-task  reinforcement  learning.  In  this  on-policy  MTRL  setting, 
multi-task  learning  will  have  two  aspects:  co-exploitation  (already  addressed  in  the  present 
report)  and  co-exploration  (not  explicitly  addressed  here).  It  is  of  interest  to  investigate 
how  much  benefit  can  be  gained  by  simultaneous  co-exploitation  and  co-exploration. 

Although  the  experiments  considered  in  the  report  mainly  involve  robot  navigation  in 
grid-worlds,  there  are  many  other  interesting  practical  problems  to  which  the  proposed  al¬ 
gorithms  are  immediately  applicable.  The  multi-aspect  classification  serves  as  a  preliminary 
example  of  such  applications.  Other  examples  include  using  RPRs  as  policies  to  control  and 
coordinate  a  set  of  sub-models  such  that  the  collective  performance  is  optimized  and  more 
advanced  tasks  could  be  accomplished  than  by  any  single  sub-model. 

For  the  work  presented  here,  the  DP  prior  is  placed  directly  on  0.  Because  of  the 
discrete  nature  of  G,  this  implies  that  when  parameters  0  are  shared  between  different 
environments,  they  are  shared  exactly.  This  may  be  too  restrictive  for  some  problems;  for 
two  environments  that  are  similar,  we  may  desire  the  associated  parameters  to  be  similar, 
but  not  exactly  the  same.  This  may  be  accommodated,  for  example,  via  the  following 
modification  to  the  DP  prior 


©ml'km  ~ 

*m\G  ~  G 
G\a,  Go  ~  DP(a,G0 ) 

This  formulation  results  in  an  infinite  mixture  model  for  0,  where  each  component  is  of 
the  form  H.  When  two  environments  share,  their  parameters  share  a  component  of  this 
infinite  mixture,  but  the  specific  draws  will  generally  differ  from  each  other  —  this  can 
provide  greater  flexibility.  The  above  modification  brings  some  challenges  to  the  inference. 
Recall  that  0  is  set  of  probability  mass  functions  (pmf),  it  is  natural  to  require  IP  to  be  a 
product  of  Dirichlets.  The  difficulty  now  lies  in  choosing  G  that  provides  a  conjugate  prior 
for  the  parameters  of  H,  which  seems  not  easy.  If  G  is  properly  specified,  however,  the 
inference  should  be  a  straightforward  extension  of  the  techniques  developed  in  this  report. 
An  alternative  to  the  above  modification  that  may  avoid  the  inference  difficulty  is  to  follow 
the  approach  in  (II  In  et,  al~l  121 II  l Ml)  to  impose  soft  sharing  by  replacing  the  Dirac  delta  with 
its  soft  version. 


Appendix 

Proof  of  Theorem  0 

According  to  QKaelbling  et  al.|  Lhlhhl).  the  expected  sum  of  exponentially  discounted  reward 
(value  function)  over  an  infinite  horizon  can  be  written  as 


V  =  E 


OO 


,t= o 


(A-l) 


where  0  <  7  <  1  is  the  discount  factor.  Let  E  denote  the  environment  in  question  and 
the  corresponding  probabilistic  model  (POMDP).  Let  0  be  the  parameters  specifying  the 
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RPR,  the  expectation  in  our  situation  here  is  Eepisodes|£©.  Thus 
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where  the  sum  over  0  <  t  <  oo  is  equal  to  the  sum  over  0  <  t  <  because  =  0  for 
t  >  Tfc  according  to  Definition  □.  Q.E.D. 

Proof  of  Theorem  0 

We  begin  our  derivation  by  writing  the  empirical  value  function  in  its  logarithm 
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Applying  Jensen’s  inequality  to  (lA-,41).  we  obtain 
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The  lower  bound  is  maximized  when 
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which  turns  the  inequality  in  into  an  equality.  Define 
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By  (lA-filV  LB(0|0)  <  LB(0|0)  =  lnl/(2?(A);0)  holds  for  any  0  and  0.  Therefore,  when 
0  =  arg maxgg^. LB(0|0),  we  have 

lny(P(i©;0)  =  LB(0|0)  <  LB(0|0)  <  LB(0|0)  =  lnV{V^-Q) 

Starting  from  0®  we  compute 
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which  satisfy  V (T><K'>]  0®))  <  E(X>(A);  ©W)  <  1/(X>(A);  0^2^)  <  ■  ■  ■ .  Since  the  value  func¬ 
tion  is  upper  bounded,  this  monotonically  increasing  sequence  must  converge,  which  hap¬ 
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at  a  maxima  of  V(V<'K'^-,  0). 


Proof  of  Lemma  □ 

Substituting  (EH)  and  (E2I),  we  have 

Right  side  of  (E3)  = 
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Since  the  denominator  is  equal  to  p{aQ.t\o\.t ,  0)  by  (DU),  we  have 
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Appendix:  Proof  of  Theorem  El 

We  rewrite  the  lower  bound  in  (G31)  as 
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We  alternatively  find  the  {g^}  and  g(0)  that  maximizes  the  lower  bound,  keeping  one  fixed 
while  finding  the  other. 

Keeping  g(O)  fixed,  we  solve  maxj^fcj  LB  ({g^}  ,  g(@))  subject  to  the  normalization 
constraint  for  {g^}.  We  construct  the  Lagrangian 
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where  A  is  the  Lagrangian  multiplier.  Differentiating  lq  with  respect  to  q^(z Q.t)  and  setting 
the  result  to  zero,  we  obtain 
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which  is  solved  to  give 
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Using  the  constraint  J2k=\  YlJt  i  Qt{zot)  =  A,  (US)  is  arrived  with  e1”AA  =  (7Z. 

zO’"'tzt~ 1 

Keeping  {g^}  fixed,  we  solve  rnaXg/Q)  LB  ( { q\: }  ,  g(0))  subject  to  the  normalization  con¬ 
straint  that  f  g(0)d©  =  1.  Construct  the  Lagrangian 

lg  =  LB({gtfc},g(0))  -  X  (l  -  J  g(e)dej  (A-15) 

where  A  is  the  Lagrangian  multiplier.  Differentiating  lg  with  respect  to  g(Q)  and  setting 
the  result  to  zero,  we  obtain 
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which  is  solved  to  give 
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By  using  the  constraint  f  g(Q)d©  =  1,  we  arrive  at  (1771)  with  e1-A  =  C*©. 
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Q.E.D. 
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